SlideShare a Scribd company logo
1 of 56
Next Gen Decision Making in <2ms
2
3
VS.
4
5
6
7
X (predictor)
Spend amount
Y (response)
Likelihood of millionaire
Simple Velocity Advanced
8
9
10
11
12
Hard Metrics Goal
Latency < 40ms
Ideally < 16ms
Throughput Goal of 2000 events / second
Durability No loss, every message gets exactly one response
Availability 99.5% uptime (downtime of 1.83 days / year);
Ideally 99.999% uptime (downtime of 5.26 minutes / year)
Scalability Can add resources, still meet latency requirements
Integration Transparently connected to existing systems – Hardware, Messaging,
HDFS
Soft Metrics Goal
Open Source All components licensed as open source
Extensibility Rules can be updated, model is regularly refreshed
13
14
Onyx
15
Enterprise
Readiness
RoadmapPerformance
Community
16
17
18
19
20
21
22
23
24
25
26
• Avg. 0.25ms, @70k records/sec, w/ 600GB RAM
Thread Local on ~54M events
Percentiles (in ms)
Throughput Count Avg (ms) 90% 95% 99% 99.9% 4 9’s 5 9’s 6 9’s
70k/sec 54,126,122 0.19 1 1 1 2 2 5 6
Performance
27
Durability
• Two physically independent pipelines on the same cluster processing
identical data
• For the same tuple, we find the best-case time between two pipelines
– 39 records out of 5.2M exceeded 16ms
– 173 out of 5.2M exceeded 16ms in one pipeline but succeeded in the other
• 99.99925% success rate – “Five Nines”
• Average Latency of 0.0981ms
28
29
30
Appendix
31
Streaming Technologies Evaluated
• Spark Streaming
• Samza
• Storm
• Feedzai
• Infosphere Streams
• Flink
• Ignite
• VoltDB
• Cassandra
• Apex
• Of all evaluated technologies, Apache Apex is the only technology that is ready to
bring the decision making solution to production based on:
– Maturity
– Fault-tolerance
– Enterprise-readiness
– Performance
• Focus on open source
• Drive Roadmap
• Competitive Advantage for C1
32
Stream Processing – Apache Storm
• An open-source, distributed, real-time computation system
– Logical operators (spouts and bolts) form statically parallelizable topologies
– Very high throughput of messages with very low latency
– Can provide <10ms latency end-end under normal operation
• Basic abstractions provide an at-least-once processing guarantee
Limitations
• Nimbus is a single point of failure
– Rectified by Hortonworks, but not yet available to the public (no timeline for release)
• Upstream bolt/spout failure triggers re-compute on entire tree
– Can only create parallel independent stream by having separate redundant topologies
• Bolts/spouts share JVM  Hard to debug
• Failed tuples cannot be replayed quicker than 1s
• No dynamic topologies
• Cannot add or remove applications without service interruption
33
Stream Processing – Apache Flink
• An open-source, distributed, real-time computation system
– Logical operators are compiled into a DAG of tasks executed by Task Managers
– Supports streaming, micro-batch, batch compute
– Supports aggregate operations on streams (reduce, join, groupBy)
– Capable of <10 ms end-end latency with streaming under normal operation
• Can provide exactly-once processing guarantees
Limitations
• Failures trigger reset of ALL operators to last checkpoint
– Depends on upstream message broker to track state
• Operators share JVM
– Failure in one brings down all tasks sharing that JVM
– Hard to debug
• No dynamic topologies
• Young community, young product
34
Stream Processing – Apache Apex
• An open-source, distributed, real-time computation system on YARN
• Apex is the core system powering DataTorrent, released under ASF
• Demonstrated high throughput with low latency running a next-generation
C1 model (avg. 0.25ms, max 2ms, @ 70k records/sec) w/ 600GB RAM
• True YARN application developed from principles of Hadoop and YARN at
Yahoo!
• Mature product (derived from proven solutions in Yahoo! Finance and
Hadoop)
– Built by team under Phu Hoang (CEO of DataTorrent, Head of Engineering at Yahoo)
who built Hadoop
– Amol (CTO of DataTorrent) led the team that built YARN
• DataTorrent (Apex) is executing on production clusters at Fortune 100
companies.
35
Stream Processing – Apache Apex
Maturity
• Designed to process and manage global data for Yahoo! Finance
– Primary focus is on stability, fault-tolerance and data management
– Only OSS streaming technology considered designed explicitly for the financial world
• Data or computation could never be lost or replicated
• Architecture had to never go down
• Goal was to make it rock-solid and enterprise-ready before worrying about performance
• Data flow across countries – perfect for use-case that requires cross-
cluster interaction
Enterprise Readiness
• Advanced support for:
– Encryption, authentication, compression, administration, and monitoring
– Deployment at scale in the cloud and on-prem – AWS, Google Cloud, Azure
• Integrates with huge set of existing tools:
– HDFS, Kafka, Cassandra, MongoDB, Redis, ElasticSearch, CouchDB, Splunk, etc.
36
Apex Platform – Summary
• Apex Architecture
– Networks of physically independent, parallelizable operators that scale dynamically
– Dynamic topology modification and deployment
– Self-healing, fault tolerant, & recoverable
• Durable messaging queues between operators, check-pointed in memory and on disk
• Resource manager is a replicated YARN process, monitors and restarts downed operators
– No single point of failure, highly modular design
– Can specify locality of execution (avoids network and inter-process latency)
• Guarantees at-least-once, at-most-once, or exactly-once processing
Directed Acyclic Graph (DAG)
Output
Stream
Tuple Tuple
er
Operator
er
Operator
er
Operator
er
Operator
37
Apex Platform – Overview
38
Apex Platform – Malhar
39
Apex Platform – Cluster View
Hadoop Edge Node
DT RTS
Management
Server
Hadoop Node
YARN Container
RTS App Master
Hadoop Node
YARN Container
YARN Container
YARN Container
Thread1
Op2
Op1
Thread-N
Op3
Streaming
Container
Hadoop Node
YARN Container
YARN Container
YARN Container
Thread1
Op2
Op1
Thread-N
Op3
Streaming
Container
CLI
REST
API
DT RTS
Management
Server
REST
API
Part of Community Edition
40
Apex Platform – Operators
• Operators can be dynamically
scaled
• Flexible stream configuration
• Parallel Redis / HDHT DAGs
• Separate visualization DAG
• Parallel partitioning
• Durability of data
• Scalability
• Organization for in-memory
store
• Unifiers
• Combine statistics from
physical partitions
41
Dynamic Topology Modification
• Can redeploy new operators and models at run-time!
• Can reconfigure settings on the fly
42
Apex Platform – Failure Recovery
• Physical independence of partitions is critical
• Redundant STRAMs
• Configurable window size and heartbeat for low-latency recovery
• Downstream failures do not affect upstream components
– Snapshotting only depends on previous operator, not all previous operators
– Can deploy parallel DAGs with same point of origin (simpler from a hardware and
deployment perspective)
43
Apex Platform – Windowing
• Sliding window and
tumbling window
• Window based on
checkpoint
• No artificial latency
• Used for stats
measurement
44
• Apex
– Great UI to monitor, debug, and control system performance
– Fault-tolerance and recovery out of the box - no additional setup, or improvement
needed
• YARN is still a single point of failure, a name node failure can still impact the system
– Built-in support for dynamic and automatic scaling to handle larger throughputs
– Native integration with Hadoop, YARN, and Kafka – next-gen standard at C1
– Mature product
• Apex is derived from the principles of Hadoop and YARN over the course of many years
• Built and planned by chief Hadoop architects
– Proven performance in production at Fortune 100 companies
Enterprise Readiness
45
Enterprise Readiness
• Storm
– Widely used but abandoned by creators at Twitter for Heron in production
• Storm debug-ability - topology components are bundled in one process
• Resource demands
– Need dedicated hardware
– Can’t scale on demand or share usage
• Topology creation/tear-down is expensive, topologies can’t share cluster resources
– Have to manually isolate & de-commission machines
– Performance in failure scenarios is insufficient for this use-case
• Flink
– Operational performance has not been proven
• Only one company (ResearchGate) officially uses Flink in production
– Architecture shares fundamental limitations of Storm with regards to
dynamically scaling operators & topologies and debugability
– Performance in failure scenarios is insufficient for this use-case
46
Performance
• Storm
– Meets latency and throughput requirements only when no failures occur.
– Resilience to failures only possible by running fully independent clusters
– Difficult to debug and operationalize complex systems (due to shared JVM and poor
resource management)
• Flink
– Broader toolset than Storm or Apex – ML, batch processing, and SQL-like queries
– Meets latency and throughput requirements only when no failures occur.
– Failures reset ALL operators back to the source – resilience only possible across
clusters
– Difficult to debug and operationalize complex systems (due to shared JVM)
• Apex
– Supports redundant parallel pipelines within the same cluster
– Outstanding latency and throughput even in failure scenarios
– Self-healing independent operators (simple to isolate failures)
– Only framework to provide fine-grained control over data and compute locality
47
Roadmap – Storm
• Commercial support from from Hortonworks but limited code
contributions
• Twitter - Storm’s largest user - has completely abandoned Storm for Heron
• Business Continuity
– Enhance Storm’s enterprise readiness with high availability (HA) and failover to standby
clusters
– Eliminate Nimbus as a single point of failure
• Operations
– Apache Ambari support for Nimbus HA node setup
– Elastic topologies via YARN and Apache Slider.
– Incremental improvements to Storm UI to easily deploy, manage and monitor
topologies.
• Enterprise readiness
– Declarative writing of spouts, bolts, and data-sources into topologies
48
Roadmap – Flink
• Fine-grained fault tolerance (avoid rollback to data source) – Q2 2015
• SQL on Flink – Q3/Q4 2015
• Integrate with distributed memory storage – No ECD
• Use off-heap memory – Q1 2015
• Integration with Samoa, Tez, Mahout DSL – No ECD
49
Roadmap – Apex
• Roadmap for next 6 months
• Support creation of reusable pluggable modules (topologies)
• Add additional operators to connect to existing technology
– Databases
– Messaging
– Modeling systems
• Add additional SQL-like operations
– Join
– Filter
– GroupBy
– Caching
• Add ability to create cycles in graph
– Allows re-use of data for ML algorithms (similar to Spark’s caching)
50
Road Map Comparison
• Storm
– Roadmap is intended to bring Storm to enterprise readiness  Storm is not enterprise
ready today according to Hortonworks
• Flink
– Roadmap brings Flink up to par with Spark and Apex, does not create new capabilities
relative to either
– Spark is more mature for batch-processing and micro-batch and Apex is more mature
from a streaming standpoint.
• Apex
– No need to improve core architecture, focus is instead on adding functionality
• Better support for ML
• Better support for wide variety of business use cases
• Better integration with existing tools
– Stated commitment to letting the community dictate direction. From incubator proposal:
• “DataTorrent plans to develop new functionality in an open, community-driven way”
51
Community
• Vendor and community involvement drive roadmap and project growth
• Storm
– Limited improvements to core components of Storm in recent months
– Limited focused and active committers
– Actively promoted and supported in public by Hortonworks
• Flink
– Some adoption in Europe, growing response in U.S.
– 11 active committers, 10 are from Data Artisans (company behind Flink)
– Community is very young, but there is substantial interest
• Apex
– Wide support network around Apex due to its evolution from Hadoop and YARN
– Young but actively growing community: http://incubator.apache.org/projects/apex.html
– Opportunity for C1 to drive growth and define the direction of this product
52
Streaming Solutions Comparison
• Apex
– Ideal for this use case, meets all performance requirements and is ready for out-of-the-
box enterprise deployment
– Committer status from C1 allows us to collaboratively drive roadmap and product
evolution to fit our business need.
• Storm
– Great for many streaming use cases but not the right fit for this effort
– Performance in failure scenarios does not meet our requirements
– Community involvement is waning and there is a limited road map for substantial
product growth
• Flink
– Poised to compete with Spark in the future based on community activity and roadmap
– Not ready for enterprise deployment:
• Technical limitations around fault-tolerance and failure recovery
• Lack of broad community involvement
• Roadmap only brings it up to par with existing frameworks
53
New Capabilities Provided by Proposed Architecture
• Millisecond Level Streaming Solution
• Fault Tolerant & Highly Available
• Parallel Model Scoring for Arbitrary Number of Models
• Quick Model Generation & Execution
• Dynamic Scalability based on Latency or Throughput
• Live Model Refresh
• A/B Testing of Models in Production
• System is Self Healing upon failure of components (**)
54
Decisioning System Architecture - Strengths
• Internal
– Capital One software, running on Capital One hardware, designed by Capital One
• Open source
– Internally maintainable code
• Living Model
– Can be re-trained on current data & updated in minutes, not years
– Offline models can expanded and re-developed and deployed to production at will
• Extensible
– Modular architecture with swappable components
• A/B Model Testing in Production
• Dynamic Deployment / Refresh of Models
55
Hardware
MDC Hardware Specifications
• Server Quantity – 15
• Server Model – Supermicro
• CPU – Intel Xeon E5-2695v2 2.4Ghz
12Cores
• Memory – 256GB
• HDD – (5) 4TB Seagate SATA
• Network Switch – Cisco Nexus 6001
10GB
• NIC – 2port SFP+ 10GbE
MDC Software Specifications
• Hadoop – v2.6.0
• Yarn – v2.6.0
• Apache Apex – v3.0
• Linux OS – RHEL v6.7
• Linux OS Kernel - 2.6.32-
573.7.1.el6.x86_64
56
Performance Comparison - Redis vs. Apex-HDHT
Apex-HDHT - Thread Local on ~2M events
Stats Percentiles (in ms)
Throughput Count Avg (ms) 90% 95% 99% 99.9% 4 9’s 5 9’s 6 9’s
70k/sec 1,807,283 0.253 1 1 1 2 2 2 2
Apex-HDHT Thread Local on ~54M events
Stats Percentiles (in ms)
Throughput Count Avg (ms) 90% 95% 99% 99.9% 4 9’s 5 9’s 6 9’s
70k/sec 54,126,122 0.19 1 1 1 2 2 5 6
Apex-HDHT No locality on ~2M events
Stats Percentiles (in ms)
Throughput Count Avg (ms) 90% 95% 99% 99.9% 4 9’s 5 9’s 6 9’s
40k/sec 2,214,777 51.651 98 126 381 489 494 495 495
Redis Thread local on ~2M events
Stats Percentiles (in ms)
Throughput Count Avg (ms) 90% 95% 99% 99.9% 4 9’s 5 9’s 6 9’s
8.5k/sec 2,018,057 13.654 16 18 20 21 22 22 22

More Related Content

What's hot

Top 5 Mistakes to Avoid When Writing Apache Spark Applications
Top 5 Mistakes to Avoid When Writing Apache Spark ApplicationsTop 5 Mistakes to Avoid When Writing Apache Spark Applications
Top 5 Mistakes to Avoid When Writing Apache Spark ApplicationsCloudera, Inc.
 
Low latency high throughput streaming using Apache Apex and Apache Kudu
Low latency high throughput streaming using Apache Apex and Apache KuduLow latency high throughput streaming using Apache Apex and Apache Kudu
Low latency high throughput streaming using Apache Apex and Apache KuduDataWorks Summit
 
Deploying Apache Flume to enable low-latency analytics
Deploying Apache Flume to enable low-latency analyticsDeploying Apache Flume to enable low-latency analytics
Deploying Apache Flume to enable low-latency analyticsDataWorks Summit
 
Data Architectures for Robust Decision Making
Data Architectures for Robust Decision MakingData Architectures for Robust Decision Making
Data Architectures for Robust Decision MakingGwen (Chen) Shapira
 
DatEngConf SF16 - Apache Kudu: Fast Analytics on Fast Data
DatEngConf SF16 - Apache Kudu: Fast Analytics on Fast DataDatEngConf SF16 - Apache Kudu: Fast Analytics on Fast Data
DatEngConf SF16 - Apache Kudu: Fast Analytics on Fast DataHakka Labs
 
Scaling ETL with Hadoop - Avoiding Failure
Scaling ETL with Hadoop - Avoiding FailureScaling ETL with Hadoop - Avoiding Failure
Scaling ETL with Hadoop - Avoiding FailureGwen (Chen) Shapira
 
Lessons learned from scaling YARN to 40K machines in a multi tenancy environment
Lessons learned from scaling YARN to 40K machines in a multi tenancy environmentLessons learned from scaling YARN to 40K machines in a multi tenancy environment
Lessons learned from scaling YARN to 40K machines in a multi tenancy environmentDataWorks Summit
 
Avoiding big data antipatterns
Avoiding big data antipatternsAvoiding big data antipatterns
Avoiding big data antipatternsgrepalex
 
Re-Architecting Spark For Performance Understandability
Re-Architecting Spark For Performance UnderstandabilityRe-Architecting Spark For Performance Understandability
Re-Architecting Spark For Performance UnderstandabilityJen Aman
 
An introduction into Spark ML plus how to go beyond when you get stuck
An introduction into Spark ML plus how to go beyond when you get stuckAn introduction into Spark ML plus how to go beyond when you get stuck
An introduction into Spark ML plus how to go beyond when you get stuckData Con LA
 
Building a High-Performance Database with Scala, Akka, and Spark
Building a High-Performance Database with Scala, Akka, and SparkBuilding a High-Performance Database with Scala, Akka, and Spark
Building a High-Performance Database with Scala, Akka, and SparkEvan Chan
 
Capital One's Next Generation Decision in less than 2 ms
Capital One's Next Generation Decision in less than 2 msCapital One's Next Generation Decision in less than 2 ms
Capital One's Next Generation Decision in less than 2 msApache Apex
 
February 2016 HUG: Apache Apex (incubating): Stream Processing Architecture a...
February 2016 HUG: Apache Apex (incubating): Stream Processing Architecture a...February 2016 HUG: Apache Apex (incubating): Stream Processing Architecture a...
February 2016 HUG: Apache Apex (incubating): Stream Processing Architecture a...Yahoo Developer Network
 
HBaseCon 2013: Apache HBase, Meet Ops. Ops, Meet Apache HBase.
HBaseCon 2013: Apache HBase, Meet Ops. Ops, Meet Apache HBase.HBaseCon 2013: Apache HBase, Meet Ops. Ops, Meet Apache HBase.
HBaseCon 2013: Apache HBase, Meet Ops. Ops, Meet Apache HBase.Cloudera, Inc.
 
Building Effective Near-Real-Time Analytics with Spark Streaming and Kudu
Building Effective Near-Real-Time Analytics with Spark Streaming and KuduBuilding Effective Near-Real-Time Analytics with Spark Streaming and Kudu
Building Effective Near-Real-Time Analytics with Spark Streaming and KuduJeremy Beard
 
Apache Spark on Supercomputers: A Tale of the Storage Hierarchy with Costin I...
Apache Spark on Supercomputers: A Tale of the Storage Hierarchy with Costin I...Apache Spark on Supercomputers: A Tale of the Storage Hierarchy with Costin I...
Apache Spark on Supercomputers: A Tale of the Storage Hierarchy with Costin I...Databricks
 

What's hot (20)

Enterprise Grade Streaming under 2ms on Hadoop
Enterprise Grade Streaming under 2ms on HadoopEnterprise Grade Streaming under 2ms on Hadoop
Enterprise Grade Streaming under 2ms on Hadoop
 
Top 5 Mistakes to Avoid When Writing Apache Spark Applications
Top 5 Mistakes to Avoid When Writing Apache Spark ApplicationsTop 5 Mistakes to Avoid When Writing Apache Spark Applications
Top 5 Mistakes to Avoid When Writing Apache Spark Applications
 
Low latency high throughput streaming using Apache Apex and Apache Kudu
Low latency high throughput streaming using Apache Apex and Apache KuduLow latency high throughput streaming using Apache Apex and Apache Kudu
Low latency high throughput streaming using Apache Apex and Apache Kudu
 
Apache kafka
Apache kafkaApache kafka
Apache kafka
 
Deploying Apache Flume to enable low-latency analytics
Deploying Apache Flume to enable low-latency analyticsDeploying Apache Flume to enable low-latency analytics
Deploying Apache Flume to enable low-latency analytics
 
Data Architectures for Robust Decision Making
Data Architectures for Robust Decision MakingData Architectures for Robust Decision Making
Data Architectures for Robust Decision Making
 
DatEngConf SF16 - Apache Kudu: Fast Analytics on Fast Data
DatEngConf SF16 - Apache Kudu: Fast Analytics on Fast DataDatEngConf SF16 - Apache Kudu: Fast Analytics on Fast Data
DatEngConf SF16 - Apache Kudu: Fast Analytics on Fast Data
 
Hive vs. Impala
Hive vs. ImpalaHive vs. Impala
Hive vs. Impala
 
Scaling ETL with Hadoop - Avoiding Failure
Scaling ETL with Hadoop - Avoiding FailureScaling ETL with Hadoop - Avoiding Failure
Scaling ETL with Hadoop - Avoiding Failure
 
Lessons learned from scaling YARN to 40K machines in a multi tenancy environment
Lessons learned from scaling YARN to 40K machines in a multi tenancy environmentLessons learned from scaling YARN to 40K machines in a multi tenancy environment
Lessons learned from scaling YARN to 40K machines in a multi tenancy environment
 
Avoiding big data antipatterns
Avoiding big data antipatternsAvoiding big data antipatterns
Avoiding big data antipatterns
 
Re-Architecting Spark For Performance Understandability
Re-Architecting Spark For Performance UnderstandabilityRe-Architecting Spark For Performance Understandability
Re-Architecting Spark For Performance Understandability
 
An introduction into Spark ML plus how to go beyond when you get stuck
An introduction into Spark ML plus how to go beyond when you get stuckAn introduction into Spark ML plus how to go beyond when you get stuck
An introduction into Spark ML plus how to go beyond when you get stuck
 
Building a High-Performance Database with Scala, Akka, and Spark
Building a High-Performance Database with Scala, Akka, and SparkBuilding a High-Performance Database with Scala, Akka, and Spark
Building a High-Performance Database with Scala, Akka, and Spark
 
Capital One's Next Generation Decision in less than 2 ms
Capital One's Next Generation Decision in less than 2 msCapital One's Next Generation Decision in less than 2 ms
Capital One's Next Generation Decision in less than 2 ms
 
February 2016 HUG: Apache Apex (incubating): Stream Processing Architecture a...
February 2016 HUG: Apache Apex (incubating): Stream Processing Architecture a...February 2016 HUG: Apache Apex (incubating): Stream Processing Architecture a...
February 2016 HUG: Apache Apex (incubating): Stream Processing Architecture a...
 
HBaseCon 2013: Apache HBase, Meet Ops. Ops, Meet Apache HBase.
HBaseCon 2013: Apache HBase, Meet Ops. Ops, Meet Apache HBase.HBaseCon 2013: Apache HBase, Meet Ops. Ops, Meet Apache HBase.
HBaseCon 2013: Apache HBase, Meet Ops. Ops, Meet Apache HBase.
 
Building Effective Near-Real-Time Analytics with Spark Streaming and Kudu
Building Effective Near-Real-Time Analytics with Spark Streaming and KuduBuilding Effective Near-Real-Time Analytics with Spark Streaming and Kudu
Building Effective Near-Real-Time Analytics with Spark Streaming and Kudu
 
The Impala Cookbook
The Impala CookbookThe Impala Cookbook
The Impala Cookbook
 
Apache Spark on Supercomputers: A Tale of the Storage Hierarchy with Costin I...
Apache Spark on Supercomputers: A Tale of the Storage Hierarchy with Costin I...Apache Spark on Supercomputers: A Tale of the Storage Hierarchy with Costin I...
Apache Spark on Supercomputers: A Tale of the Storage Hierarchy with Costin I...
 

Similar to Next Gen Decision Making in <2ms with Apache Apex

IBM MQ High Availabillity and Disaster Recovery (2017 version)
IBM MQ High Availabillity and Disaster Recovery (2017 version)IBM MQ High Availabillity and Disaster Recovery (2017 version)
IBM MQ High Availabillity and Disaster Recovery (2017 version)MarkTaylorIBM
 
IBM MQ - High Availability and Disaster Recovery
IBM MQ - High Availability and Disaster RecoveryIBM MQ - High Availability and Disaster Recovery
IBM MQ - High Availability and Disaster RecoveryMarkTaylorIBM
 
Ame 2269 ibm mq high availability
Ame 2269 ibm mq high availabilityAme 2269 ibm mq high availability
Ame 2269 ibm mq high availabilityAndrew Schofield
 
24 Hours of PASS, Summit Preview Session: Virtual SQL Server CPUs
24 Hours of PASS, Summit Preview Session: Virtual SQL Server CPUs24 Hours of PASS, Summit Preview Session: Virtual SQL Server CPUs
24 Hours of PASS, Summit Preview Session: Virtual SQL Server CPUsDavid Klee
 
The Challenges facing Libraries and Imperative Languages from Massively Paral...
The Challenges facing Libraries and Imperative Languages from Massively Paral...The Challenges facing Libraries and Imperative Languages from Massively Paral...
The Challenges facing Libraries and Imperative Languages from Massively Paral...Jason Hearne-McGuiness
 
Robert Hensel resume v111416 Lnkedin
Robert Hensel resume v111416 LnkedinRobert Hensel resume v111416 Lnkedin
Robert Hensel resume v111416 LnkedinBob Hensel
 
Distributed systems and scalability rules
Distributed systems and scalability rulesDistributed systems and scalability rules
Distributed systems and scalability rulesOleg Tsal-Tsalko
 
High availability and disaster recovery in IBM PureApplication System
High availability and disaster recovery in IBM PureApplication SystemHigh availability and disaster recovery in IBM PureApplication System
High availability and disaster recovery in IBM PureApplication SystemScott Moonen
 
Real time Operating System
Real time Operating SystemReal time Operating System
Real time Operating SystemTech_MX
 
Maria DB Galera Cluster for High Availability
Maria DB Galera Cluster for High AvailabilityMaria DB Galera Cluster for High Availability
Maria DB Galera Cluster for High AvailabilityOSSCube
 
MariaDB Galera Cluster
MariaDB Galera ClusterMariaDB Galera Cluster
MariaDB Galera ClusterAbdul Manaf
 
Platform Security Summit 18: Xen Security Weather Report 2018
Platform Security Summit 18: Xen Security Weather Report 2018Platform Security Summit 18: Xen Security Weather Report 2018
Platform Security Summit 18: Xen Security Weather Report 2018The Linux Foundation
 
Postgresql in Education
Postgresql in EducationPostgresql in Education
Postgresql in Educationdostatni
 
Infraestructura oracle
Infraestructura oracleInfraestructura oracle
Infraestructura oracleFran Navarro
 
PCA_Admin_Presentation-1.pptx
PCA_Admin_Presentation-1.pptxPCA_Admin_Presentation-1.pptx
PCA_Admin_Presentation-1.pptxssuser21ded1
 
VMware Log Insight
VMware Log Insight VMware Log Insight
VMware Log Insight Iwan Rahabok
 
The Top 5 Reasons to Deploy Your Applications on Oracle RAC
The Top 5 Reasons to Deploy Your Applications on Oracle RACThe Top 5 Reasons to Deploy Your Applications on Oracle RAC
The Top 5 Reasons to Deploy Your Applications on Oracle RACMarkus Michalewicz
 

Similar to Next Gen Decision Making in <2ms with Apache Apex (20)

IBM MQ High Availabillity and Disaster Recovery (2017 version)
IBM MQ High Availabillity and Disaster Recovery (2017 version)IBM MQ High Availabillity and Disaster Recovery (2017 version)
IBM MQ High Availabillity and Disaster Recovery (2017 version)
 
IBM MQ - High Availability and Disaster Recovery
IBM MQ - High Availability and Disaster RecoveryIBM MQ - High Availability and Disaster Recovery
IBM MQ - High Availability and Disaster Recovery
 
4.exalogic ferhat final
4.exalogic ferhat final4.exalogic ferhat final
4.exalogic ferhat final
 
Ame 2269 ibm mq high availability
Ame 2269 ibm mq high availabilityAme 2269 ibm mq high availability
Ame 2269 ibm mq high availability
 
24 Hours of PASS, Summit Preview Session: Virtual SQL Server CPUs
24 Hours of PASS, Summit Preview Session: Virtual SQL Server CPUs24 Hours of PASS, Summit Preview Session: Virtual SQL Server CPUs
24 Hours of PASS, Summit Preview Session: Virtual SQL Server CPUs
 
The Challenges facing Libraries and Imperative Languages from Massively Paral...
The Challenges facing Libraries and Imperative Languages from Massively Paral...The Challenges facing Libraries and Imperative Languages from Massively Paral...
The Challenges facing Libraries and Imperative Languages from Massively Paral...
 
Robert Hensel resume v111416 Lnkedin
Robert Hensel resume v111416 LnkedinRobert Hensel resume v111416 Lnkedin
Robert Hensel resume v111416 Lnkedin
 
Teradata training
Teradata trainingTeradata training
Teradata training
 
Distributed systems and scalability rules
Distributed systems and scalability rulesDistributed systems and scalability rules
Distributed systems and scalability rules
 
High availability and disaster recovery in IBM PureApplication System
High availability and disaster recovery in IBM PureApplication SystemHigh availability and disaster recovery in IBM PureApplication System
High availability and disaster recovery in IBM PureApplication System
 
Real time Operating System
Real time Operating SystemReal time Operating System
Real time Operating System
 
Maria DB Galera Cluster for High Availability
Maria DB Galera Cluster for High AvailabilityMaria DB Galera Cluster for High Availability
Maria DB Galera Cluster for High Availability
 
MariaDB Galera Cluster
MariaDB Galera ClusterMariaDB Galera Cluster
MariaDB Galera Cluster
 
Platform Security Summit 18: Xen Security Weather Report 2018
Platform Security Summit 18: Xen Security Weather Report 2018Platform Security Summit 18: Xen Security Weather Report 2018
Platform Security Summit 18: Xen Security Weather Report 2018
 
Postgresql in Education
Postgresql in EducationPostgresql in Education
Postgresql in Education
 
Infraestructura oracle
Infraestructura oracleInfraestructura oracle
Infraestructura oracle
 
PCA_Admin_Presentation-1.pptx
PCA_Admin_Presentation-1.pptxPCA_Admin_Presentation-1.pptx
PCA_Admin_Presentation-1.pptx
 
les12.pdf
les12.pdfles12.pdf
les12.pdf
 
VMware Log Insight
VMware Log Insight VMware Log Insight
VMware Log Insight
 
The Top 5 Reasons to Deploy Your Applications on Oracle RAC
The Top 5 Reasons to Deploy Your Applications on Oracle RACThe Top 5 Reasons to Deploy Your Applications on Oracle RAC
The Top 5 Reasons to Deploy Your Applications on Oracle RAC
 

Recently uploaded

VIP Call Girls Service Hitech City Hyderabad Call +91-8250192130
VIP Call Girls Service Hitech City Hyderabad Call +91-8250192130VIP Call Girls Service Hitech City Hyderabad Call +91-8250192130
VIP Call Girls Service Hitech City Hyderabad Call +91-8250192130Suhani Kapoor
 
VIP Call Girls Service Kondapur Hyderabad Call +91-8250192130
VIP Call Girls Service Kondapur Hyderabad Call +91-8250192130VIP Call Girls Service Kondapur Hyderabad Call +91-8250192130
VIP Call Girls Service Kondapur Hyderabad Call +91-8250192130Suhani Kapoor
 
(MEERA) Dapodi Call Girls Just Call 7001035870 [ Cash on Delivery ] Pune Escorts
(MEERA) Dapodi Call Girls Just Call 7001035870 [ Cash on Delivery ] Pune Escorts(MEERA) Dapodi Call Girls Just Call 7001035870 [ Cash on Delivery ] Pune Escorts
(MEERA) Dapodi Call Girls Just Call 7001035870 [ Cash on Delivery ] Pune Escortsranjana rawat
 
(ANJALI) Dange Chowk Call Girls Just Call 7001035870 [ Cash on Delivery ] Pun...
(ANJALI) Dange Chowk Call Girls Just Call 7001035870 [ Cash on Delivery ] Pun...(ANJALI) Dange Chowk Call Girls Just Call 7001035870 [ Cash on Delivery ] Pun...
(ANJALI) Dange Chowk Call Girls Just Call 7001035870 [ Cash on Delivery ] Pun...ranjana rawat
 
GDSC ASEB Gen AI study jams presentation
GDSC ASEB Gen AI study jams presentationGDSC ASEB Gen AI study jams presentation
GDSC ASEB Gen AI study jams presentationGDSCAESB
 
Decoding Kotlin - Your guide to solving the mysterious in Kotlin.pptx
Decoding Kotlin - Your guide to solving the mysterious in Kotlin.pptxDecoding Kotlin - Your guide to solving the mysterious in Kotlin.pptx
Decoding Kotlin - Your guide to solving the mysterious in Kotlin.pptxJoão Esperancinha
 
OSVC_Meta-Data based Simulation Automation to overcome Verification Challenge...
OSVC_Meta-Data based Simulation Automation to overcome Verification Challenge...OSVC_Meta-Data based Simulation Automation to overcome Verification Challenge...
OSVC_Meta-Data based Simulation Automation to overcome Verification Challenge...Soham Mondal
 
Introduction to IEEE STANDARDS and its different types.pptx
Introduction to IEEE STANDARDS and its different types.pptxIntroduction to IEEE STANDARDS and its different types.pptx
Introduction to IEEE STANDARDS and its different types.pptxupamatechverse
 
Call Girls in Nagpur Suman Call 7001035870 Meet With Nagpur Escorts
Call Girls in Nagpur Suman Call 7001035870 Meet With Nagpur EscortsCall Girls in Nagpur Suman Call 7001035870 Meet With Nagpur Escorts
Call Girls in Nagpur Suman Call 7001035870 Meet With Nagpur EscortsCall Girls in Nagpur High Profile
 
Analog to Digital and Digital to Analog Converter
Analog to Digital and Digital to Analog ConverterAnalog to Digital and Digital to Analog Converter
Analog to Digital and Digital to Analog ConverterAbhinavSharma374939
 
Call for Papers - African Journal of Biological Sciences, E-ISSN: 2663-2187, ...
Call for Papers - African Journal of Biological Sciences, E-ISSN: 2663-2187, ...Call for Papers - African Journal of Biological Sciences, E-ISSN: 2663-2187, ...
Call for Papers - African Journal of Biological Sciences, E-ISSN: 2663-2187, ...Christo Ananth
 
Processing & Properties of Floor and Wall Tiles.pptx
Processing & Properties of Floor and Wall Tiles.pptxProcessing & Properties of Floor and Wall Tiles.pptx
Processing & Properties of Floor and Wall Tiles.pptxpranjaldaimarysona
 
Sheet Pile Wall Design and Construction: A Practical Guide for Civil Engineer...
Sheet Pile Wall Design and Construction: A Practical Guide for Civil Engineer...Sheet Pile Wall Design and Construction: A Practical Guide for Civil Engineer...
Sheet Pile Wall Design and Construction: A Practical Guide for Civil Engineer...Dr.Costas Sachpazis
 
the ladakh protest in leh ladakh 2024 sonam wangchuk.pptx
the ladakh protest in leh ladakh 2024 sonam wangchuk.pptxthe ladakh protest in leh ladakh 2024 sonam wangchuk.pptx
the ladakh protest in leh ladakh 2024 sonam wangchuk.pptxhumanexperienceaaa
 
The Most Attractive Pune Call Girls Budhwar Peth 8250192130 Will You Miss Thi...
The Most Attractive Pune Call Girls Budhwar Peth 8250192130 Will You Miss Thi...The Most Attractive Pune Call Girls Budhwar Peth 8250192130 Will You Miss Thi...
The Most Attractive Pune Call Girls Budhwar Peth 8250192130 Will You Miss Thi...ranjana rawat
 
Model Call Girl in Narela Delhi reach out to us at 🔝8264348440🔝
Model Call Girl in Narela Delhi reach out to us at 🔝8264348440🔝Model Call Girl in Narela Delhi reach out to us at 🔝8264348440🔝
Model Call Girl in Narela Delhi reach out to us at 🔝8264348440🔝soniya singh
 
SPICE PARK APR2024 ( 6,793 SPICE Models )
SPICE PARK APR2024 ( 6,793 SPICE Models )SPICE PARK APR2024 ( 6,793 SPICE Models )
SPICE PARK APR2024 ( 6,793 SPICE Models )Tsuyoshi Horigome
 
Porous Ceramics seminar and technical writing
Porous Ceramics seminar and technical writingPorous Ceramics seminar and technical writing
Porous Ceramics seminar and technical writingrakeshbaidya232001
 

Recently uploaded (20)

Exploring_Network_Security_with_JA3_by_Rakesh Seal.pptx
Exploring_Network_Security_with_JA3_by_Rakesh Seal.pptxExploring_Network_Security_with_JA3_by_Rakesh Seal.pptx
Exploring_Network_Security_with_JA3_by_Rakesh Seal.pptx
 
Call Us -/9953056974- Call Girls In Vikaspuri-/- Delhi NCR
Call Us -/9953056974- Call Girls In Vikaspuri-/- Delhi NCRCall Us -/9953056974- Call Girls In Vikaspuri-/- Delhi NCR
Call Us -/9953056974- Call Girls In Vikaspuri-/- Delhi NCR
 
VIP Call Girls Service Hitech City Hyderabad Call +91-8250192130
VIP Call Girls Service Hitech City Hyderabad Call +91-8250192130VIP Call Girls Service Hitech City Hyderabad Call +91-8250192130
VIP Call Girls Service Hitech City Hyderabad Call +91-8250192130
 
VIP Call Girls Service Kondapur Hyderabad Call +91-8250192130
VIP Call Girls Service Kondapur Hyderabad Call +91-8250192130VIP Call Girls Service Kondapur Hyderabad Call +91-8250192130
VIP Call Girls Service Kondapur Hyderabad Call +91-8250192130
 
(MEERA) Dapodi Call Girls Just Call 7001035870 [ Cash on Delivery ] Pune Escorts
(MEERA) Dapodi Call Girls Just Call 7001035870 [ Cash on Delivery ] Pune Escorts(MEERA) Dapodi Call Girls Just Call 7001035870 [ Cash on Delivery ] Pune Escorts
(MEERA) Dapodi Call Girls Just Call 7001035870 [ Cash on Delivery ] Pune Escorts
 
(ANJALI) Dange Chowk Call Girls Just Call 7001035870 [ Cash on Delivery ] Pun...
(ANJALI) Dange Chowk Call Girls Just Call 7001035870 [ Cash on Delivery ] Pun...(ANJALI) Dange Chowk Call Girls Just Call 7001035870 [ Cash on Delivery ] Pun...
(ANJALI) Dange Chowk Call Girls Just Call 7001035870 [ Cash on Delivery ] Pun...
 
GDSC ASEB Gen AI study jams presentation
GDSC ASEB Gen AI study jams presentationGDSC ASEB Gen AI study jams presentation
GDSC ASEB Gen AI study jams presentation
 
Decoding Kotlin - Your guide to solving the mysterious in Kotlin.pptx
Decoding Kotlin - Your guide to solving the mysterious in Kotlin.pptxDecoding Kotlin - Your guide to solving the mysterious in Kotlin.pptx
Decoding Kotlin - Your guide to solving the mysterious in Kotlin.pptx
 
OSVC_Meta-Data based Simulation Automation to overcome Verification Challenge...
OSVC_Meta-Data based Simulation Automation to overcome Verification Challenge...OSVC_Meta-Data based Simulation Automation to overcome Verification Challenge...
OSVC_Meta-Data based Simulation Automation to overcome Verification Challenge...
 
Introduction to IEEE STANDARDS and its different types.pptx
Introduction to IEEE STANDARDS and its different types.pptxIntroduction to IEEE STANDARDS and its different types.pptx
Introduction to IEEE STANDARDS and its different types.pptx
 
Call Girls in Nagpur Suman Call 7001035870 Meet With Nagpur Escorts
Call Girls in Nagpur Suman Call 7001035870 Meet With Nagpur EscortsCall Girls in Nagpur Suman Call 7001035870 Meet With Nagpur Escorts
Call Girls in Nagpur Suman Call 7001035870 Meet With Nagpur Escorts
 
Analog to Digital and Digital to Analog Converter
Analog to Digital and Digital to Analog ConverterAnalog to Digital and Digital to Analog Converter
Analog to Digital and Digital to Analog Converter
 
Call for Papers - African Journal of Biological Sciences, E-ISSN: 2663-2187, ...
Call for Papers - African Journal of Biological Sciences, E-ISSN: 2663-2187, ...Call for Papers - African Journal of Biological Sciences, E-ISSN: 2663-2187, ...
Call for Papers - African Journal of Biological Sciences, E-ISSN: 2663-2187, ...
 
Processing & Properties of Floor and Wall Tiles.pptx
Processing & Properties of Floor and Wall Tiles.pptxProcessing & Properties of Floor and Wall Tiles.pptx
Processing & Properties of Floor and Wall Tiles.pptx
 
Sheet Pile Wall Design and Construction: A Practical Guide for Civil Engineer...
Sheet Pile Wall Design and Construction: A Practical Guide for Civil Engineer...Sheet Pile Wall Design and Construction: A Practical Guide for Civil Engineer...
Sheet Pile Wall Design and Construction: A Practical Guide for Civil Engineer...
 
the ladakh protest in leh ladakh 2024 sonam wangchuk.pptx
the ladakh protest in leh ladakh 2024 sonam wangchuk.pptxthe ladakh protest in leh ladakh 2024 sonam wangchuk.pptx
the ladakh protest in leh ladakh 2024 sonam wangchuk.pptx
 
The Most Attractive Pune Call Girls Budhwar Peth 8250192130 Will You Miss Thi...
The Most Attractive Pune Call Girls Budhwar Peth 8250192130 Will You Miss Thi...The Most Attractive Pune Call Girls Budhwar Peth 8250192130 Will You Miss Thi...
The Most Attractive Pune Call Girls Budhwar Peth 8250192130 Will You Miss Thi...
 
Model Call Girl in Narela Delhi reach out to us at 🔝8264348440🔝
Model Call Girl in Narela Delhi reach out to us at 🔝8264348440🔝Model Call Girl in Narela Delhi reach out to us at 🔝8264348440🔝
Model Call Girl in Narela Delhi reach out to us at 🔝8264348440🔝
 
SPICE PARK APR2024 ( 6,793 SPICE Models )
SPICE PARK APR2024 ( 6,793 SPICE Models )SPICE PARK APR2024 ( 6,793 SPICE Models )
SPICE PARK APR2024 ( 6,793 SPICE Models )
 
Porous Ceramics seminar and technical writing
Porous Ceramics seminar and technical writingPorous Ceramics seminar and technical writing
Porous Ceramics seminar and technical writing
 

Next Gen Decision Making in <2ms with Apache Apex

  • 1. Next Gen Decision Making in <2ms
  • 2. 2
  • 4. 4
  • 5. 5
  • 6. 6
  • 7. 7 X (predictor) Spend amount Y (response) Likelihood of millionaire Simple Velocity Advanced
  • 8. 8
  • 9. 9
  • 10. 10
  • 11. 11
  • 12. 12 Hard Metrics Goal Latency < 40ms Ideally < 16ms Throughput Goal of 2000 events / second Durability No loss, every message gets exactly one response Availability 99.5% uptime (downtime of 1.83 days / year); Ideally 99.999% uptime (downtime of 5.26 minutes / year) Scalability Can add resources, still meet latency requirements Integration Transparently connected to existing systems – Hardware, Messaging, HDFS Soft Metrics Goal Open Source All components licensed as open source Extensibility Rules can be updated, model is regularly refreshed
  • 13. 13
  • 16. 16
  • 17. 17
  • 18. 18
  • 19. 19
  • 20. 20
  • 21. 21
  • 22. 22
  • 23. 23
  • 24. 24
  • 25. 25
  • 26. 26 • Avg. 0.25ms, @70k records/sec, w/ 600GB RAM Thread Local on ~54M events Percentiles (in ms) Throughput Count Avg (ms) 90% 95% 99% 99.9% 4 9’s 5 9’s 6 9’s 70k/sec 54,126,122 0.19 1 1 1 2 2 5 6 Performance
  • 27. 27 Durability • Two physically independent pipelines on the same cluster processing identical data • For the same tuple, we find the best-case time between two pipelines – 39 records out of 5.2M exceeded 16ms – 173 out of 5.2M exceeded 16ms in one pipeline but succeeded in the other • 99.99925% success rate – “Five Nines” • Average Latency of 0.0981ms
  • 28. 28
  • 29. 29
  • 31. 31 Streaming Technologies Evaluated • Spark Streaming • Samza • Storm • Feedzai • Infosphere Streams • Flink • Ignite • VoltDB • Cassandra • Apex • Of all evaluated technologies, Apache Apex is the only technology that is ready to bring the decision making solution to production based on: – Maturity – Fault-tolerance – Enterprise-readiness – Performance • Focus on open source • Drive Roadmap • Competitive Advantage for C1
  • 32. 32 Stream Processing – Apache Storm • An open-source, distributed, real-time computation system – Logical operators (spouts and bolts) form statically parallelizable topologies – Very high throughput of messages with very low latency – Can provide <10ms latency end-end under normal operation • Basic abstractions provide an at-least-once processing guarantee Limitations • Nimbus is a single point of failure – Rectified by Hortonworks, but not yet available to the public (no timeline for release) • Upstream bolt/spout failure triggers re-compute on entire tree – Can only create parallel independent stream by having separate redundant topologies • Bolts/spouts share JVM  Hard to debug • Failed tuples cannot be replayed quicker than 1s • No dynamic topologies • Cannot add or remove applications without service interruption
  • 33. 33 Stream Processing – Apache Flink • An open-source, distributed, real-time computation system – Logical operators are compiled into a DAG of tasks executed by Task Managers – Supports streaming, micro-batch, batch compute – Supports aggregate operations on streams (reduce, join, groupBy) – Capable of <10 ms end-end latency with streaming under normal operation • Can provide exactly-once processing guarantees Limitations • Failures trigger reset of ALL operators to last checkpoint – Depends on upstream message broker to track state • Operators share JVM – Failure in one brings down all tasks sharing that JVM – Hard to debug • No dynamic topologies • Young community, young product
  • 34. 34 Stream Processing – Apache Apex • An open-source, distributed, real-time computation system on YARN • Apex is the core system powering DataTorrent, released under ASF • Demonstrated high throughput with low latency running a next-generation C1 model (avg. 0.25ms, max 2ms, @ 70k records/sec) w/ 600GB RAM • True YARN application developed from principles of Hadoop and YARN at Yahoo! • Mature product (derived from proven solutions in Yahoo! Finance and Hadoop) – Built by team under Phu Hoang (CEO of DataTorrent, Head of Engineering at Yahoo) who built Hadoop – Amol (CTO of DataTorrent) led the team that built YARN • DataTorrent (Apex) is executing on production clusters at Fortune 100 companies.
  • 35. 35 Stream Processing – Apache Apex Maturity • Designed to process and manage global data for Yahoo! Finance – Primary focus is on stability, fault-tolerance and data management – Only OSS streaming technology considered designed explicitly for the financial world • Data or computation could never be lost or replicated • Architecture had to never go down • Goal was to make it rock-solid and enterprise-ready before worrying about performance • Data flow across countries – perfect for use-case that requires cross- cluster interaction Enterprise Readiness • Advanced support for: – Encryption, authentication, compression, administration, and monitoring – Deployment at scale in the cloud and on-prem – AWS, Google Cloud, Azure • Integrates with huge set of existing tools: – HDFS, Kafka, Cassandra, MongoDB, Redis, ElasticSearch, CouchDB, Splunk, etc.
  • 36. 36 Apex Platform – Summary • Apex Architecture – Networks of physically independent, parallelizable operators that scale dynamically – Dynamic topology modification and deployment – Self-healing, fault tolerant, & recoverable • Durable messaging queues between operators, check-pointed in memory and on disk • Resource manager is a replicated YARN process, monitors and restarts downed operators – No single point of failure, highly modular design – Can specify locality of execution (avoids network and inter-process latency) • Guarantees at-least-once, at-most-once, or exactly-once processing Directed Acyclic Graph (DAG) Output Stream Tuple Tuple er Operator er Operator er Operator er Operator
  • 39. 39 Apex Platform – Cluster View Hadoop Edge Node DT RTS Management Server Hadoop Node YARN Container RTS App Master Hadoop Node YARN Container YARN Container YARN Container Thread1 Op2 Op1 Thread-N Op3 Streaming Container Hadoop Node YARN Container YARN Container YARN Container Thread1 Op2 Op1 Thread-N Op3 Streaming Container CLI REST API DT RTS Management Server REST API Part of Community Edition
  • 40. 40 Apex Platform – Operators • Operators can be dynamically scaled • Flexible stream configuration • Parallel Redis / HDHT DAGs • Separate visualization DAG • Parallel partitioning • Durability of data • Scalability • Organization for in-memory store • Unifiers • Combine statistics from physical partitions
  • 41. 41 Dynamic Topology Modification • Can redeploy new operators and models at run-time! • Can reconfigure settings on the fly
  • 42. 42 Apex Platform – Failure Recovery • Physical independence of partitions is critical • Redundant STRAMs • Configurable window size and heartbeat for low-latency recovery • Downstream failures do not affect upstream components – Snapshotting only depends on previous operator, not all previous operators – Can deploy parallel DAGs with same point of origin (simpler from a hardware and deployment perspective)
  • 43. 43 Apex Platform – Windowing • Sliding window and tumbling window • Window based on checkpoint • No artificial latency • Used for stats measurement
  • 44. 44 • Apex – Great UI to monitor, debug, and control system performance – Fault-tolerance and recovery out of the box - no additional setup, or improvement needed • YARN is still a single point of failure, a name node failure can still impact the system – Built-in support for dynamic and automatic scaling to handle larger throughputs – Native integration with Hadoop, YARN, and Kafka – next-gen standard at C1 – Mature product • Apex is derived from the principles of Hadoop and YARN over the course of many years • Built and planned by chief Hadoop architects – Proven performance in production at Fortune 100 companies Enterprise Readiness
  • 45. 45 Enterprise Readiness • Storm – Widely used but abandoned by creators at Twitter for Heron in production • Storm debug-ability - topology components are bundled in one process • Resource demands – Need dedicated hardware – Can’t scale on demand or share usage • Topology creation/tear-down is expensive, topologies can’t share cluster resources – Have to manually isolate & de-commission machines – Performance in failure scenarios is insufficient for this use-case • Flink – Operational performance has not been proven • Only one company (ResearchGate) officially uses Flink in production – Architecture shares fundamental limitations of Storm with regards to dynamically scaling operators & topologies and debugability – Performance in failure scenarios is insufficient for this use-case
  • 46. 46 Performance • Storm – Meets latency and throughput requirements only when no failures occur. – Resilience to failures only possible by running fully independent clusters – Difficult to debug and operationalize complex systems (due to shared JVM and poor resource management) • Flink – Broader toolset than Storm or Apex – ML, batch processing, and SQL-like queries – Meets latency and throughput requirements only when no failures occur. – Failures reset ALL operators back to the source – resilience only possible across clusters – Difficult to debug and operationalize complex systems (due to shared JVM) • Apex – Supports redundant parallel pipelines within the same cluster – Outstanding latency and throughput even in failure scenarios – Self-healing independent operators (simple to isolate failures) – Only framework to provide fine-grained control over data and compute locality
  • 47. 47 Roadmap – Storm • Commercial support from from Hortonworks but limited code contributions • Twitter - Storm’s largest user - has completely abandoned Storm for Heron • Business Continuity – Enhance Storm’s enterprise readiness with high availability (HA) and failover to standby clusters – Eliminate Nimbus as a single point of failure • Operations – Apache Ambari support for Nimbus HA node setup – Elastic topologies via YARN and Apache Slider. – Incremental improvements to Storm UI to easily deploy, manage and monitor topologies. • Enterprise readiness – Declarative writing of spouts, bolts, and data-sources into topologies
  • 48. 48 Roadmap – Flink • Fine-grained fault tolerance (avoid rollback to data source) – Q2 2015 • SQL on Flink – Q3/Q4 2015 • Integrate with distributed memory storage – No ECD • Use off-heap memory – Q1 2015 • Integration with Samoa, Tez, Mahout DSL – No ECD
  • 49. 49 Roadmap – Apex • Roadmap for next 6 months • Support creation of reusable pluggable modules (topologies) • Add additional operators to connect to existing technology – Databases – Messaging – Modeling systems • Add additional SQL-like operations – Join – Filter – GroupBy – Caching • Add ability to create cycles in graph – Allows re-use of data for ML algorithms (similar to Spark’s caching)
  • 50. 50 Road Map Comparison • Storm – Roadmap is intended to bring Storm to enterprise readiness  Storm is not enterprise ready today according to Hortonworks • Flink – Roadmap brings Flink up to par with Spark and Apex, does not create new capabilities relative to either – Spark is more mature for batch-processing and micro-batch and Apex is more mature from a streaming standpoint. • Apex – No need to improve core architecture, focus is instead on adding functionality • Better support for ML • Better support for wide variety of business use cases • Better integration with existing tools – Stated commitment to letting the community dictate direction. From incubator proposal: • “DataTorrent plans to develop new functionality in an open, community-driven way”
  • 51. 51 Community • Vendor and community involvement drive roadmap and project growth • Storm – Limited improvements to core components of Storm in recent months – Limited focused and active committers – Actively promoted and supported in public by Hortonworks • Flink – Some adoption in Europe, growing response in U.S. – 11 active committers, 10 are from Data Artisans (company behind Flink) – Community is very young, but there is substantial interest • Apex – Wide support network around Apex due to its evolution from Hadoop and YARN – Young but actively growing community: http://incubator.apache.org/projects/apex.html – Opportunity for C1 to drive growth and define the direction of this product
  • 52. 52 Streaming Solutions Comparison • Apex – Ideal for this use case, meets all performance requirements and is ready for out-of-the- box enterprise deployment – Committer status from C1 allows us to collaboratively drive roadmap and product evolution to fit our business need. • Storm – Great for many streaming use cases but not the right fit for this effort – Performance in failure scenarios does not meet our requirements – Community involvement is waning and there is a limited road map for substantial product growth • Flink – Poised to compete with Spark in the future based on community activity and roadmap – Not ready for enterprise deployment: • Technical limitations around fault-tolerance and failure recovery • Lack of broad community involvement • Roadmap only brings it up to par with existing frameworks
  • 53. 53 New Capabilities Provided by Proposed Architecture • Millisecond Level Streaming Solution • Fault Tolerant & Highly Available • Parallel Model Scoring for Arbitrary Number of Models • Quick Model Generation & Execution • Dynamic Scalability based on Latency or Throughput • Live Model Refresh • A/B Testing of Models in Production • System is Self Healing upon failure of components (**)
  • 54. 54 Decisioning System Architecture - Strengths • Internal – Capital One software, running on Capital One hardware, designed by Capital One • Open source – Internally maintainable code • Living Model – Can be re-trained on current data & updated in minutes, not years – Offline models can expanded and re-developed and deployed to production at will • Extensible – Modular architecture with swappable components • A/B Model Testing in Production • Dynamic Deployment / Refresh of Models
  • 55. 55 Hardware MDC Hardware Specifications • Server Quantity – 15 • Server Model – Supermicro • CPU – Intel Xeon E5-2695v2 2.4Ghz 12Cores • Memory – 256GB • HDD – (5) 4TB Seagate SATA • Network Switch – Cisco Nexus 6001 10GB • NIC – 2port SFP+ 10GbE MDC Software Specifications • Hadoop – v2.6.0 • Yarn – v2.6.0 • Apache Apex – v3.0 • Linux OS – RHEL v6.7 • Linux OS Kernel - 2.6.32- 573.7.1.el6.x86_64
  • 56. 56 Performance Comparison - Redis vs. Apex-HDHT Apex-HDHT - Thread Local on ~2M events Stats Percentiles (in ms) Throughput Count Avg (ms) 90% 95% 99% 99.9% 4 9’s 5 9’s 6 9’s 70k/sec 1,807,283 0.253 1 1 1 2 2 2 2 Apex-HDHT Thread Local on ~54M events Stats Percentiles (in ms) Throughput Count Avg (ms) 90% 95% 99% 99.9% 4 9’s 5 9’s 6 9’s 70k/sec 54,126,122 0.19 1 1 1 2 2 5 6 Apex-HDHT No locality on ~2M events Stats Percentiles (in ms) Throughput Count Avg (ms) 90% 95% 99% 99.9% 4 9’s 5 9’s 6 9’s 40k/sec 2,214,777 51.651 98 126 381 489 494 495 495 Redis Thread local on ~2M events Stats Percentiles (in ms) Throughput Count Avg (ms) 90% 95% 99% 99.9% 4 9’s 5 9’s 6 9’s 8.5k/sec 2,018,057 13.654 16 18 20 21 22 22 22

Editor's Notes

  1. Spark streaming is missing – nonstarter due to microbatch, lack of dynamic dag reconfiguration
  2. Fast Easy to use Mature *********** Failures are not independent, nimbus, no dynamic topologies, 1 sec ack, resource usage Community stagnating, only Horton Roadmap still to bring it to enterprise (Integration with YARN, elastic topologies, high availability)
  3. Easy to Use Fast Support for SQL-like queries ----- Meeting Notes (10/27/15 10:14) ----- Reset to upstream data source, Shared JVM, No dynamic topologies Young community Roadmap – fine grained fault tolerance, in-memory store integration, off-heap memory, full SQL
  4. Veterans from Yahoo! Finance and Hadoop Built for Enterprise stability and durability before performance ***************** Phu Hoang - CEO and Co-Founder head of engineering at Yahoo Amol Kekre - Led Yahoo! Finance and Led YARN Chetan - Lead architect from Yahoo Finance Thomas Weise - Hadoop Veteran from Yahoo
  5. Dynamic topologies Downstream components do not affect upstream
  6. Fine grained control of locality No single point of failure Operators are independent
  7. Independence of partitions Auto-scaling (throughput and latency)
  8. Batch, micro-batch, and true streaming
  9. Owner: Dongming
  10. Owner: Dongming