SlideShare a Scribd company logo
1 of 19
Download to read offline
Extending the Yahoo
Streaming Benchmark for Apache Apex
San Jose Apache Apex Meetup
May 4th
2016
Sandesh Hegde
sandesh@apache.org
Background
• Yahoo created a benchmark to compare Stream processing systems and
compared Storm, Flink and Spark Streaming [1]
• dataArtisans extended the benchmark by comparing Flink and Storm with
different scenarios [2]
• No benchmark comparison about Stream processing is complete without
including Apache Apex.
2
Yahoo Streaming Benchmark
Simple Advertisement Application : To see how many times an ad
campaign has been seen in an window.
• Read ads from Kafka
• Deserialize JSON string
• Filter unnecessary ads
• Projection of Fields ( remove non-essential fields )
• Join ad id with campaign id from Redis
• Windowed count per campaign and output to Redis
3
Application - with Kafka
4
Kafka Input Deserialize FilterKafka Redis OutputRedis JoinFilter Fields
Setup
• Intel(R) Xeon(R) CPU E5-2630 0 @ 2.30GHz
• 10GigE Between compute nodes
• 4 Kafka Brokers ( 2 Partitions each & 1 Replica )
• Kafka Version : 0.8.2
• Apex ( 3.4-SNAPSHOT & 3.3 ) & Flink ( 1.0.2 )
• Yarn-Containers size: 16GB
• 1 ZooKeeper
• Message Size: 218 Bytes
• Sample Message: {"user_id":"e5e0db4b-05ea-4ac5-af7a-4bba5ed27c4c","
page_id":"80f60d0a-b02b-40e2-a667-5548a1120dda","ad_id":"
600589859","ad_type":"banner78","event_type":"purchase","event_time":"
1462374087774","ip_address":"1.2.3.4"}
5
Apex Application
6
Physical Plan
7
Quick Primer on Locality
8
• CONTAINER_LOCAL
■ Deployed in the same process, different threads
■ No serialization
■ Queue between the operators
• THREAD_LOCAL
■ Same thread
■ No serialization
■ Use it only when operators do light work
Note: [New feature] Anti Affinity is not covered here.
Benchmarking Against Previous Releases
9
https://www.datatorrent.com/blog/blog-apex-performance-benchmark/
Part of Release Certification
Application : with Kafka
10
https://github.com/sandeshh/streaming-benchmarks
Application - With Generator
11
Kafka Input Deserialize FilterKafka Redis OutputRedis JoinFilter Fields
Generator
Application - With Generator
12
https://github.com/sandeshh/streaming-benchmarks
Setup: Single Partition
State of the Art & Streaming
13
Generator Filter Redis OutputRedis JoinFilter Fields
What’s our recommendation to query the State?
In memory Key-Value store in the operators?
Application - State Store & Query
14
Generator Filter
Dimensional
Computation
Redis JoinFilter Fields Store (HDHT) QueryResult
1. Durable state ( HDHT is a key value store native to Hadoop ) [4]
2. Single System, scales with your application
3. Easy integration with external Consoles [7]
4. Low operability cost
5. Complex Dimensional Computation [5][6]
Demo
15
Q&A
16
References
17
1. https://yahooeng.tumblr.com/post/135321837876/benchmarking-streaming-computation-engines-at
2. http://data-artisans.com/extending-the-yahoo-streaming-benchmark/
3. https://www.datatorrent.com/blog/blog-apex-performance-benchmark/
4. https://www.datatorrent.com/blog/data-store-for-scalable-stream-processing/
5. https://www.datatorrent.com/blog/blog-dimensions-computation-aggregate-navigator-part-1-intro/
6. https://www.datatorrent.com/blog/dimensions-computation-aggregate-navigator-part-2-
implementation/
7. http://docs.datatorrent.com/app_data_framework/
© 2016 DataTorrent
Resources
18
• Apache Apex website - http://apex.apache.org/
• Subscribe - http://apex.apache.org/community.html
• Download - http://apex.apache.org/downloads.html
• Twitter - @ApacheApex; Follow - https://twitter.com/apacheapex
• Facebook - https://www.facebook.com/ApacheApex/
• Meetup - http://www.meetup.com/topics/apache-apex
• Free Enterprise License for Startups - https://www.datatorrent.com/product/startup-
accelerator/
© 2016 DataTorrent
We Are Hiring
19
• jobs@datatorrent.com
• Developers/Architects
• QA Automation Developers
• Information Developers
• Build and Release
• Community Leaders

More Related Content

What's hot

Introduction to Apache Apex and writing a big data streaming application
Introduction to Apache Apex and writing a big data streaming application  Introduction to Apache Apex and writing a big data streaming application
Introduction to Apache Apex and writing a big data streaming application
Apache Apex
 
Developing streaming applications with apache apex (strata + hadoop world)
Developing streaming applications with apache apex (strata + hadoop world)Developing streaming applications with apache apex (strata + hadoop world)
Developing streaming applications with apache apex (strata + hadoop world)
Apache Apex
 

What's hot (20)

From Batch to Streaming with Apache Apex Dataworks Summit 2017
From Batch to Streaming with Apache Apex Dataworks Summit 2017From Batch to Streaming with Apache Apex Dataworks Summit 2017
From Batch to Streaming with Apache Apex Dataworks Summit 2017
 
Introduction to Apache Apex
Introduction to Apache ApexIntroduction to Apache Apex
Introduction to Apache Apex
 
Building your first aplication using Apache Apex
Building your first aplication using Apache ApexBuilding your first aplication using Apache Apex
Building your first aplication using Apache Apex
 
Intro to Apache Apex - Next Gen Native Hadoop Platform - Hackac
Intro to Apache Apex - Next Gen Native Hadoop Platform - HackacIntro to Apache Apex - Next Gen Native Hadoop Platform - Hackac
Intro to Apache Apex - Next Gen Native Hadoop Platform - Hackac
 
Introduction to Apache Apex - CoDS 2016
Introduction to Apache Apex - CoDS 2016Introduction to Apache Apex - CoDS 2016
Introduction to Apache Apex - CoDS 2016
 
Apex as yarn application
Apex as yarn applicationApex as yarn application
Apex as yarn application
 
Intro to Apache Apex (next gen Hadoop) & comparison to Spark Streaming
Intro to Apache Apex (next gen Hadoop) & comparison to Spark StreamingIntro to Apache Apex (next gen Hadoop) & comparison to Spark Streaming
Intro to Apache Apex (next gen Hadoop) & comparison to Spark Streaming
 
Introduction to Apache Apex and writing a big data streaming application
Introduction to Apache Apex and writing a big data streaming application  Introduction to Apache Apex and writing a big data streaming application
Introduction to Apache Apex and writing a big data streaming application
 
DataTorrent Presentation @ Big Data Application Meetup
DataTorrent Presentation @ Big Data Application MeetupDataTorrent Presentation @ Big Data Application Meetup
DataTorrent Presentation @ Big Data Application Meetup
 
Java High Level Stream API
Java High Level Stream APIJava High Level Stream API
Java High Level Stream API
 
Actionable Insights with Apache Apex at Apache Big Data 2017 by Devendra Tagare
Actionable Insights with Apache Apex at Apache Big Data 2017 by Devendra TagareActionable Insights with Apache Apex at Apache Big Data 2017 by Devendra Tagare
Actionable Insights with Apache Apex at Apache Big Data 2017 by Devendra Tagare
 
Building Your First Apache Apex (Next Gen Big Data/Hadoop) Application
Building Your First Apache Apex (Next Gen Big Data/Hadoop) ApplicationBuilding Your First Apache Apex (Next Gen Big Data/Hadoop) Application
Building Your First Apache Apex (Next Gen Big Data/Hadoop) Application
 
Apache Apex: Stream Processing Architecture and Applications
Apache Apex: Stream Processing Architecture and ApplicationsApache Apex: Stream Processing Architecture and Applications
Apache Apex: Stream Processing Architecture and Applications
 
Smart Partitioning with Apache Apex (Webinar)
Smart Partitioning with Apache Apex (Webinar)Smart Partitioning with Apache Apex (Webinar)
Smart Partitioning with Apache Apex (Webinar)
 
Architectual Comparison of Apache Apex and Spark Streaming
Architectual Comparison of Apache Apex and Spark StreamingArchitectual Comparison of Apache Apex and Spark Streaming
Architectual Comparison of Apache Apex and Spark Streaming
 
Kafka to Hadoop Ingest with Parsing, Dedup and other Big Data Transformations
Kafka to Hadoop Ingest with Parsing, Dedup and other Big Data TransformationsKafka to Hadoop Ingest with Parsing, Dedup and other Big Data Transformations
Kafka to Hadoop Ingest with Parsing, Dedup and other Big Data Transformations
 
Developing streaming applications with apache apex (strata + hadoop world)
Developing streaming applications with apache apex (strata + hadoop world)Developing streaming applications with apache apex (strata + hadoop world)
Developing streaming applications with apache apex (strata + hadoop world)
 
Ingestion and Dimensions Compute and Enrich using Apache Apex
Ingestion and Dimensions Compute and Enrich using Apache ApexIngestion and Dimensions Compute and Enrich using Apache Apex
Ingestion and Dimensions Compute and Enrich using Apache Apex
 
Intro to YARN (Hadoop 2.0) & Apex as YARN App (Next Gen Big Data)
Intro to YARN (Hadoop 2.0) & Apex as YARN App (Next Gen Big Data)Intro to YARN (Hadoop 2.0) & Apex as YARN App (Next Gen Big Data)
Intro to YARN (Hadoop 2.0) & Apex as YARN App (Next Gen Big Data)
 
Hadoop Summit SJ 2016: Next Gen Big Data Analytics with Apache Apex
Hadoop Summit SJ 2016: Next Gen Big Data Analytics with Apache ApexHadoop Summit SJ 2016: Next Gen Big Data Analytics with Apache Apex
Hadoop Summit SJ 2016: Next Gen Big Data Analytics with Apache Apex
 

Viewers also liked

Viewers also liked (9)

Apache Apex Fault Tolerance and Processing Semantics
Apache Apex Fault Tolerance and Processing SemanticsApache Apex Fault Tolerance and Processing Semantics
Apache Apex Fault Tolerance and Processing Semantics
 
Extending the Yahoo Streaming Benchmark
Extending the Yahoo Streaming BenchmarkExtending the Yahoo Streaming Benchmark
Extending the Yahoo Streaming Benchmark
 
Windowing in Apache Apex
Windowing in Apache ApexWindowing in Apache Apex
Windowing in Apache Apex
 
Apache Apex Fault Tolerance and Processing Semantics
Apache Apex Fault Tolerance and Processing SemanticsApache Apex Fault Tolerance and Processing Semantics
Apache Apex Fault Tolerance and Processing Semantics
 
Stream Processing use cases and applications with Apache Apex by Thomas Weise
Stream Processing use cases and applications with Apache Apex by Thomas WeiseStream Processing use cases and applications with Apache Apex by Thomas Weise
Stream Processing use cases and applications with Apache Apex by Thomas Weise
 
Deep Dive into Apache Apex App Development
Deep Dive into Apache Apex App DevelopmentDeep Dive into Apache Apex App Development
Deep Dive into Apache Apex App Development
 
GE IOT Predix Time Series & Data Ingestion Service using Apache Apex (Hadoop)
GE IOT Predix Time Series & Data Ingestion Service using Apache Apex (Hadoop)GE IOT Predix Time Series & Data Ingestion Service using Apache Apex (Hadoop)
GE IOT Predix Time Series & Data Ingestion Service using Apache Apex (Hadoop)
 
Capital One's Next Generation Decision in less than 2 ms
Capital One's Next Generation Decision in less than 2 msCapital One's Next Generation Decision in less than 2 ms
Capital One's Next Generation Decision in less than 2 ms
 
最近のストリーム処理事情振り返り
最近のストリーム処理事情振り返り最近のストリーム処理事情振り返り
最近のストリーム処理事情振り返り
 

Similar to Extending The Yahoo Streaming Benchmark to Apache Apex

1 extreme performance - part i
1   extreme performance - part i1   extreme performance - part i
1 extreme performance - part i
sqlserver.co.il
 
Web Scale Reasoning and the LarKC Project
Web Scale Reasoning and the LarKC ProjectWeb Scale Reasoning and the LarKC Project
Web Scale Reasoning and the LarKC Project
Saltlux Inc.
 
Spark Development Lifecycle at Workday - ApacheCon 2020
Spark Development Lifecycle at Workday - ApacheCon 2020Spark Development Lifecycle at Workday - ApacheCon 2020
Spark Development Lifecycle at Workday - ApacheCon 2020
Pavel Hardak
 
Apache Spark Development Lifecycle @ Workday - ApacheCon 2020
Apache Spark Development Lifecycle @ Workday - ApacheCon 2020Apache Spark Development Lifecycle @ Workday - ApacheCon 2020
Apache Spark Development Lifecycle @ Workday - ApacheCon 2020
Eren Avşaroğulları
 
Intro to big data analytics using microsoft machine learning server with spark
Intro to big data analytics using microsoft machine learning server with sparkIntro to big data analytics using microsoft machine learning server with spark
Intro to big data analytics using microsoft machine learning server with spark
Alex Zeltov
 
Testing in the Cloud using Panda
Testing in the Cloud using PandaTesting in the Cloud using Panda
Testing in the Cloud using Panda
Tao Jiang
 

Similar to Extending The Yahoo Streaming Benchmark to Apache Apex (20)

DevOpsDays Taipei 2019 - Mastering IaC the DevOps Way
DevOpsDays Taipei 2019 - Mastering IaC the DevOps WayDevOpsDays Taipei 2019 - Mastering IaC the DevOps Way
DevOpsDays Taipei 2019 - Mastering IaC the DevOps Way
 
Primavera gateway SAP provider - Oracle Primavera P6 Collaborate 14
Primavera gateway SAP provider - Oracle Primavera P6 Collaborate 14Primavera gateway SAP provider - Oracle Primavera P6 Collaborate 14
Primavera gateway SAP provider - Oracle Primavera P6 Collaborate 14
 
Fabian Hueske – Cascading on Flink
Fabian Hueske – Cascading on FlinkFabian Hueske – Cascading on Flink
Fabian Hueske – Cascading on Flink
 
Apache Big Data EU 2016: Building Streaming Applications with Apache Apex
Apache Big Data EU 2016: Building Streaming Applications with Apache ApexApache Big Data EU 2016: Building Streaming Applications with Apache Apex
Apache Big Data EU 2016: Building Streaming Applications with Apache Apex
 
ansible_rhel_90.pdf
ansible_rhel_90.pdfansible_rhel_90.pdf
ansible_rhel_90.pdf
 
23.03.2015 Minsk .NET Meetup #15: Interop Between Scala and .NET
23.03.2015 Minsk .NET Meetup #15: Interop Between Scala and .NET23.03.2015 Minsk .NET Meetup #15: Interop Between Scala and .NET
23.03.2015 Minsk .NET Meetup #15: Interop Between Scala and .NET
 
Eugene Bova "Dapr (Distributed Application Runtime) in a Microservices Archit...
Eugene Bova "Dapr (Distributed Application Runtime) in a Microservices Archit...Eugene Bova "Dapr (Distributed Application Runtime) in a Microservices Archit...
Eugene Bova "Dapr (Distributed Application Runtime) in a Microservices Archit...
 
1 extreme performance - part i
1   extreme performance - part i1   extreme performance - part i
1 extreme performance - part i
 
Web Scale Reasoning and the LarKC Project
Web Scale Reasoning and the LarKC ProjectWeb Scale Reasoning and the LarKC Project
Web Scale Reasoning and the LarKC Project
 
Extending the Yahoo Streaming Benchmark + MapR Benchmarks
Extending the Yahoo Streaming Benchmark + MapR BenchmarksExtending the Yahoo Streaming Benchmark + MapR Benchmarks
Extending the Yahoo Streaming Benchmark + MapR Benchmarks
 
Spark Development Lifecycle at Workday - ApacheCon 2020
Spark Development Lifecycle at Workday - ApacheCon 2020Spark Development Lifecycle at Workday - ApacheCon 2020
Spark Development Lifecycle at Workday - ApacheCon 2020
 
Apache Spark Development Lifecycle @ Workday - ApacheCon 2020
Apache Spark Development Lifecycle @ Workday - ApacheCon 2020Apache Spark Development Lifecycle @ Workday - ApacheCon 2020
Apache Spark Development Lifecycle @ Workday - ApacheCon 2020
 
A Big Data Lake Based on Spark for BBVA Bank-(Oscar Mendez, STRATIO)
A Big Data Lake Based on Spark for BBVA Bank-(Oscar Mendez, STRATIO)A Big Data Lake Based on Spark for BBVA Bank-(Oscar Mendez, STRATIO)
A Big Data Lake Based on Spark for BBVA Bank-(Oscar Mendez, STRATIO)
 
SharePoint 2010 Boost your farm performance!
SharePoint 2010 Boost your farm performance!SharePoint 2010 Boost your farm performance!
SharePoint 2010 Boost your farm performance!
 
Intro to big data analytics using microsoft machine learning server with spark
Intro to big data analytics using microsoft machine learning server with sparkIntro to big data analytics using microsoft machine learning server with spark
Intro to big data analytics using microsoft machine learning server with spark
 
Ml2
Ml2Ml2
Ml2
 
WSO2 Stream Processor: Graphical Editor, HTTP & Message Trace Analytics and More
WSO2 Stream Processor: Graphical Editor, HTTP & Message Trace Analytics and MoreWSO2 Stream Processor: Graphical Editor, HTTP & Message Trace Analytics and More
WSO2 Stream Processor: Graphical Editor, HTTP & Message Trace Analytics and More
 
What's New in .Net 4.5
What's New in .Net 4.5What's New in .Net 4.5
What's New in .Net 4.5
 
Testing in the Cloud using Panda
Testing in the Cloud using PandaTesting in the Cloud using Panda
Testing in the Cloud using Panda
 
Elasticsearch + Cascading for Scalable Log Processing
Elasticsearch + Cascading for Scalable Log ProcessingElasticsearch + Cascading for Scalable Log Processing
Elasticsearch + Cascading for Scalable Log Processing
 

More from Apache Apex

More from Apache Apex (12)

Low Latency Polyglot Model Scoring using Apache Apex
Low Latency Polyglot Model Scoring using Apache ApexLow Latency Polyglot Model Scoring using Apache Apex
Low Latency Polyglot Model Scoring using Apache Apex
 
Hadoop Interacting with HDFS
Hadoop Interacting with HDFSHadoop Interacting with HDFS
Hadoop Interacting with HDFS
 
Introduction to Real-Time Data Processing
Introduction to Real-Time Data ProcessingIntroduction to Real-Time Data Processing
Introduction to Real-Time Data Processing
 
Introduction to Yarn
Introduction to YarnIntroduction to Yarn
Introduction to Yarn
 
Introduction to Map Reduce
Introduction to Map ReduceIntroduction to Map Reduce
Introduction to Map Reduce
 
HDFS Internals
HDFS InternalsHDFS Internals
HDFS Internals
 
Intro to Big Data Hadoop
Intro to Big Data HadoopIntro to Big Data Hadoop
Intro to Big Data Hadoop
 
Big Data Berlin v8.0 Stream Processing with Apache Apex
Big Data Berlin v8.0 Stream Processing with Apache Apex Big Data Berlin v8.0 Stream Processing with Apache Apex
Big Data Berlin v8.0 Stream Processing with Apache Apex
 
Apache Beam (incubating)
Apache Beam (incubating)Apache Beam (incubating)
Apache Beam (incubating)
 
Making sense of Apache Bigtop's role in ODPi and how it matters to Apache Apex
Making sense of Apache Bigtop's role in ODPi and how it matters to Apache ApexMaking sense of Apache Bigtop's role in ODPi and how it matters to Apache Apex
Making sense of Apache Bigtop's role in ODPi and how it matters to Apache Apex
 
Apache Apex & Bigtop
Apache Apex & BigtopApache Apex & Bigtop
Apache Apex & Bigtop
 
Building Your First Apache Apex Application
Building Your First Apache Apex ApplicationBuilding Your First Apache Apex Application
Building Your First Apache Apex Application
 

Recently uploaded

Future Visions: Predictions to Guide and Time Tech Innovation, Peter Udo Diehl
Future Visions: Predictions to Guide and Time Tech Innovation, Peter Udo DiehlFuture Visions: Predictions to Guide and Time Tech Innovation, Peter Udo Diehl
Future Visions: Predictions to Guide and Time Tech Innovation, Peter Udo Diehl
Peter Udo Diehl
 

Recently uploaded (20)

ECS 2024 Teams Premium - Pretty Secure
ECS 2024   Teams Premium - Pretty SecureECS 2024   Teams Premium - Pretty Secure
ECS 2024 Teams Premium - Pretty Secure
 
Demystifying gRPC in .Net by John Staveley
Demystifying gRPC in .Net by John StaveleyDemystifying gRPC in .Net by John Staveley
Demystifying gRPC in .Net by John Staveley
 
Introduction to FDO and How It works Applications _ Richard at FIDO Alliance.pdf
Introduction to FDO and How It works Applications _ Richard at FIDO Alliance.pdfIntroduction to FDO and How It works Applications _ Richard at FIDO Alliance.pdf
Introduction to FDO and How It works Applications _ Richard at FIDO Alliance.pdf
 
The Metaverse: Are We There Yet?
The  Metaverse:    Are   We  There  Yet?The  Metaverse:    Are   We  There  Yet?
The Metaverse: Are We There Yet?
 
THE BEST IPTV in GERMANY for 2024: IPTVreel
THE BEST IPTV in  GERMANY for 2024: IPTVreelTHE BEST IPTV in  GERMANY for 2024: IPTVreel
THE BEST IPTV in GERMANY for 2024: IPTVreel
 
Unpacking Value Delivery - Agile Oxford Meetup - May 2024.pptx
Unpacking Value Delivery - Agile Oxford Meetup - May 2024.pptxUnpacking Value Delivery - Agile Oxford Meetup - May 2024.pptx
Unpacking Value Delivery - Agile Oxford Meetup - May 2024.pptx
 
ASRock Industrial FDO Solutions in Action for Industrial Edge AI _ Kenny at A...
ASRock Industrial FDO Solutions in Action for Industrial Edge AI _ Kenny at A...ASRock Industrial FDO Solutions in Action for Industrial Edge AI _ Kenny at A...
ASRock Industrial FDO Solutions in Action for Industrial Edge AI _ Kenny at A...
 
Free and Effective: Making Flows Publicly Accessible, Yumi Ibrahimzade
Free and Effective: Making Flows Publicly Accessible, Yumi IbrahimzadeFree and Effective: Making Flows Publicly Accessible, Yumi Ibrahimzade
Free and Effective: Making Flows Publicly Accessible, Yumi Ibrahimzade
 
What's New in Teams Calling, Meetings and Devices April 2024
What's New in Teams Calling, Meetings and Devices April 2024What's New in Teams Calling, Meetings and Devices April 2024
What's New in Teams Calling, Meetings and Devices April 2024
 
Secure Zero Touch enabled Edge compute with Dell NativeEdge via FDO _ Brad at...
Secure Zero Touch enabled Edge compute with Dell NativeEdge via FDO _ Brad at...Secure Zero Touch enabled Edge compute with Dell NativeEdge via FDO _ Brad at...
Secure Zero Touch enabled Edge compute with Dell NativeEdge via FDO _ Brad at...
 
Intro in Product Management - Коротко про професію продакт менеджера
Intro in Product Management - Коротко про професію продакт менеджераIntro in Product Management - Коротко про професію продакт менеджера
Intro in Product Management - Коротко про професію продакт менеджера
 
Connecting the Dots in Product Design at KAYAK
Connecting the Dots in Product Design at KAYAKConnecting the Dots in Product Design at KAYAK
Connecting the Dots in Product Design at KAYAK
 
10 Differences between Sales Cloud and CPQ, Blanka Doktorová
10 Differences between Sales Cloud and CPQ, Blanka Doktorová10 Differences between Sales Cloud and CPQ, Blanka Doktorová
10 Differences between Sales Cloud and CPQ, Blanka Doktorová
 
Extensible Python: Robustness through Addition - PyCon 2024
Extensible Python: Robustness through Addition - PyCon 2024Extensible Python: Robustness through Addition - PyCon 2024
Extensible Python: Robustness through Addition - PyCon 2024
 
SOQL 201 for Admins & Developers: Slice & Dice Your Org’s Data With Aggregate...
SOQL 201 for Admins & Developers: Slice & Dice Your Org’s Data With Aggregate...SOQL 201 for Admins & Developers: Slice & Dice Your Org’s Data With Aggregate...
SOQL 201 for Admins & Developers: Slice & Dice Your Org’s Data With Aggregate...
 
Oauth 2.0 Introduction and Flows with MuleSoft
Oauth 2.0 Introduction and Flows with MuleSoftOauth 2.0 Introduction and Flows with MuleSoft
Oauth 2.0 Introduction and Flows with MuleSoft
 
Choosing the Right FDO Deployment Model for Your Application _ Geoffrey at In...
Choosing the Right FDO Deployment Model for Your Application _ Geoffrey at In...Choosing the Right FDO Deployment Model for Your Application _ Geoffrey at In...
Choosing the Right FDO Deployment Model for Your Application _ Geoffrey at In...
 
Future Visions: Predictions to Guide and Time Tech Innovation, Peter Udo Diehl
Future Visions: Predictions to Guide and Time Tech Innovation, Peter Udo DiehlFuture Visions: Predictions to Guide and Time Tech Innovation, Peter Udo Diehl
Future Visions: Predictions to Guide and Time Tech Innovation, Peter Udo Diehl
 
AI revolution and Salesforce, Jiří Karpíšek
AI revolution and Salesforce, Jiří KarpíšekAI revolution and Salesforce, Jiří Karpíšek
AI revolution and Salesforce, Jiří Karpíšek
 
Speed Wins: From Kafka to APIs in Minutes
Speed Wins: From Kafka to APIs in MinutesSpeed Wins: From Kafka to APIs in Minutes
Speed Wins: From Kafka to APIs in Minutes
 

Extending The Yahoo Streaming Benchmark to Apache Apex

  • 1. Extending the Yahoo Streaming Benchmark for Apache Apex San Jose Apache Apex Meetup May 4th 2016 Sandesh Hegde sandesh@apache.org
  • 2. Background • Yahoo created a benchmark to compare Stream processing systems and compared Storm, Flink and Spark Streaming [1] • dataArtisans extended the benchmark by comparing Flink and Storm with different scenarios [2] • No benchmark comparison about Stream processing is complete without including Apache Apex. 2
  • 3. Yahoo Streaming Benchmark Simple Advertisement Application : To see how many times an ad campaign has been seen in an window. • Read ads from Kafka • Deserialize JSON string • Filter unnecessary ads • Projection of Fields ( remove non-essential fields ) • Join ad id with campaign id from Redis • Windowed count per campaign and output to Redis 3
  • 4. Application - with Kafka 4 Kafka Input Deserialize FilterKafka Redis OutputRedis JoinFilter Fields
  • 5. Setup • Intel(R) Xeon(R) CPU E5-2630 0 @ 2.30GHz • 10GigE Between compute nodes • 4 Kafka Brokers ( 2 Partitions each & 1 Replica ) • Kafka Version : 0.8.2 • Apex ( 3.4-SNAPSHOT & 3.3 ) & Flink ( 1.0.2 ) • Yarn-Containers size: 16GB • 1 ZooKeeper • Message Size: 218 Bytes • Sample Message: {"user_id":"e5e0db4b-05ea-4ac5-af7a-4bba5ed27c4c"," page_id":"80f60d0a-b02b-40e2-a667-5548a1120dda","ad_id":" 600589859","ad_type":"banner78","event_type":"purchase","event_time":" 1462374087774","ip_address":"1.2.3.4"} 5
  • 8. Quick Primer on Locality 8 • CONTAINER_LOCAL ■ Deployed in the same process, different threads ■ No serialization ■ Queue between the operators • THREAD_LOCAL ■ Same thread ■ No serialization ■ Use it only when operators do light work Note: [New feature] Anti Affinity is not covered here.
  • 9. Benchmarking Against Previous Releases 9 https://www.datatorrent.com/blog/blog-apex-performance-benchmark/ Part of Release Certification
  • 10. Application : with Kafka 10 https://github.com/sandeshh/streaming-benchmarks
  • 11. Application - With Generator 11 Kafka Input Deserialize FilterKafka Redis OutputRedis JoinFilter Fields Generator
  • 12. Application - With Generator 12 https://github.com/sandeshh/streaming-benchmarks Setup: Single Partition
  • 13. State of the Art & Streaming 13 Generator Filter Redis OutputRedis JoinFilter Fields What’s our recommendation to query the State? In memory Key-Value store in the operators?
  • 14. Application - State Store & Query 14 Generator Filter Dimensional Computation Redis JoinFilter Fields Store (HDHT) QueryResult 1. Durable state ( HDHT is a key value store native to Hadoop ) [4] 2. Single System, scales with your application 3. Easy integration with external Consoles [7] 4. Low operability cost 5. Complex Dimensional Computation [5][6]
  • 17. References 17 1. https://yahooeng.tumblr.com/post/135321837876/benchmarking-streaming-computation-engines-at 2. http://data-artisans.com/extending-the-yahoo-streaming-benchmark/ 3. https://www.datatorrent.com/blog/blog-apex-performance-benchmark/ 4. https://www.datatorrent.com/blog/data-store-for-scalable-stream-processing/ 5. https://www.datatorrent.com/blog/blog-dimensions-computation-aggregate-navigator-part-1-intro/ 6. https://www.datatorrent.com/blog/dimensions-computation-aggregate-navigator-part-2- implementation/ 7. http://docs.datatorrent.com/app_data_framework/
  • 18. © 2016 DataTorrent Resources 18 • Apache Apex website - http://apex.apache.org/ • Subscribe - http://apex.apache.org/community.html • Download - http://apex.apache.org/downloads.html • Twitter - @ApacheApex; Follow - https://twitter.com/apacheapex • Facebook - https://www.facebook.com/ApacheApex/ • Meetup - http://www.meetup.com/topics/apache-apex • Free Enterprise License for Startups - https://www.datatorrent.com/product/startup- accelerator/
  • 19. © 2016 DataTorrent We Are Hiring 19 • jobs@datatorrent.com • Developers/Architects • QA Automation Developers • Information Developers • Build and Release • Community Leaders