Successfully reported this slideshow.
We use your LinkedIn profile and activity data to personalize ads and to show you more relevant ads. You can change your ad preferences anytime.
Low latency high throughput streaming using
Apache Apex and Apache Kudu
Dataworks Summit 2017
Ananth Gundabattula
Hypothetical business case
Introduction Apex Kudu Q&A
2
• IOT style enabled end user devices
• Click Streams
• Smart phone...
Introduction Apex Kudu Q&A
3
• Activity recognition stream is processed by a machine learning model to detect human activi...
Introduction Apex Kudu Q&A
4
Apache Apex introduction
Low latency Distributed Streaming
Enterprise grade features
- Highly...
Introduction Apex Kudu Q&A
5
Apache Kudu introduction
Tabular structure Distributed
Low latency
random access
Interesting ...
Introduction Apex Kudu Q&A
6
Apex as a streaming engine
<100 ms
Fraud vs No Fraud
Time dimension
Logical unit 1 Logical un...
Introduction Apex Kudu Q&A
Apex - distributed engine
Scale up & Scale out
HAR is too much
Data
Time dimension
Logical unit...
Introduction Apex Kudu Q&A
Apex - application layout
RESTAPI
Bare Metal/ Cloud/Virtual
Hadoop - HDFS + YARN
All major dist...
Introduction Apex Kudu Q&A
Apex application development model
•Stream is a sequence of data tuples
• An Operator consumes ...
Introduction Apex Kudu Q&A
Apex application deployment model
• Non-intrusive model to meet overall SLAs
• Different operat...
Introduction Apex Kudu Q&A
Unifiers
Functionality specific
scaling is causing
backpressure on
downstream operators
KI CE M...
Introduction Apex Kudu Q&A
Parallel partitioning
I want to avoid shuffles
for the lowest cross
operator latencies
KI CE MS...
Introduction Apex Kudu Q&A
Dynamic partitioning
• Utilise hardware for nightly batch
compute needs
• Ex: Zip code based av...
Introduction Apex Kudu Q&A
Pub sub
• High performant in-memory pub-sub messaging
• Provides ordering & idempotency for fai...
Introduction Apex Kudu Q&A
Checkpointing
• Non-intrusive streaming window markers
• In-memory processing of data & checkpo...
Introduction Apex Kudu Q&A
Processing guarantees
But machines fail /
Application
needs upgrade B T T T T T T T T T T TB B ...
Introduction Apex Kudu Q&A
Kudu output operator - Exactly once
I want exactly once
semantics but Kudu does not
support tra...
Introduction Apex Kudu Q&A
Apex Command line client
• Apex Command line client provides capabilities for
• Launching an ap...
Introduction Apex Kudu Q&A
Apex Kudu Integration
• Kudu Input Operator
• Scans a single table using a SQL expression using...
Introduction Apex Kudu Q&A
Kudu Output Operator
• Same POJO mapping to multiple
tables
•No extra transformation required
•...
Introduction Apex Kudu Q&A
Kudu output operator Autometrics
•Apex engine allows for metrics collection and monitoring
• Te...
Introduction Apex Kudu Q&A
Kudu Input Operator
• Scans a single kudu table
• Streams one row as POJO tuple to downstream O...
Introduction Apex Kudu Q&A
Kudu Input Operator Partitioning options
I want to optimise the
second application basing on
th...
Introduction Apex Kudu Q&A
Kudu Input Operator Fault tolerance
Can I make use of Kudu
replication to account for
HA of inp...
Introduction Apex Kudu Q&A
Kudu Input Operator Scan ordering
• Simple configuration switch to choose
between random order &...
Introduction Apex Kudu Q&A
Kudu Input Operator Control tuples
• Apex allows for control tuples ( user defined watermarks ) ...
Introduction Apex Kudu Q&A
Kudu input operator extensibility Time travel operator
•As part of SQL expression allows for se...
Introduction Apex Kudu Q&A
Production References
• GE prefix platform processes IOT streaming data for analytics at sub-mil...
Introduction Apex Kudu Q&A
Q & A
• Apex Community http://apex.apache.org/
community.html
• Docs http://apex.apache.org/doc...
Upcoming SlideShare
Loading in …5
×

Low latency high throughput streaming using Apache Apex and Apache Kudu

1,830 views

Published on

True streaming is fast becoming a necessity for many business use cases. On the other hand the data set sizes and volumes are also growing exponentially compounding the complexity of data processing pipelines.There exists a need for true low latency streaming coupled with very high throughput data processing. Apache Apex as a low latency and high throughput data processing framework and Apache Kudu as a high throughput store form a nice combination which solves this pattern very efficiently.

This session will walk through a use case which involves writing a high throughput stream using Apache Kafka,Apache Apex and Apache Kudu. The session will start with a general overview of Apache Apex and capabilities of Apex that form the foundation for a low latency and high throughput engine with Apache kafka being an example input source of streams. Subsequently we walk through Kudu integration with Apex by walking through various patterns like end to end exactly once, selective column writes and timestamp propagations for out of band data. The session will also cover additional patterns that this integration will cover for enterprise level data processing pipelines.

The session will conclude with some metrics for latency and throughput numbers for the use case that is presented.

Speaker
Ananth Gundabattula, Senior Architect, Commonwealth Bank of Australia

Published in: Technology

Low latency high throughput streaming using Apache Apex and Apache Kudu

  1. 1. Low latency high throughput streaming using Apache Apex and Apache Kudu Dataworks Summit 2017 Ananth Gundabattula
  2. 2. Hypothetical business case Introduction Apex Kudu Q&A 2 • IOT style enabled end user devices • Click Streams • Smart phones - Accelerometer, Gyroscope • Better fraud rules • Behavioural analytics by ingesting Human activity recognition feeds • Did the user really travel to another Geo ? • Does the user generally drive around this area ?
  3. 3. Introduction Apex Kudu Q&A 3 • Activity recognition stream is processed by a machine learning model to detect human activity • Data needs to be processed well within a lower end of double digit millisecond time frames • Data needs to be available for querying within a few milliseconds for operational analytics Solution Goals
  4. 4. Introduction Apex Kudu Q&A 4 Apache Apex introduction Low latency Distributed Streaming Enterprise grade features - Highly customisable DAG - Checkpointing - End to End Exactly once - Hadoop Security compatible - YARN enabled 1. 2. 3. 4.
  5. 5. Introduction Apex Kudu Q&A 5 Apache Kudu introduction Tabular structure Distributed Low latency random access Interesting features - Auto compaction - Mutable - Columnar optimised storage format - Fault tolerant - Hadoop ecosystem citizen 1. 2. 3. 4.
  6. 6. Introduction Apex Kudu Q&A 6 Apex as a streaming engine <100 ms Fraud vs No Fraud Time dimension Logical unit 1 Logical unit 2 Logical unit 3 VS Time dimension Logical unit 1 Logical unit 2 Logical unit 3
  7. 7. Introduction Apex Kudu Q&A Apex - distributed engine Scale up & Scale out HAR is too much Data Time dimension Logical unit 1a Logical unit 2a Logical unit 3 Time dimension Logical unit 1b Logical unit 2b Logical unit 3 Time dimension Logical unit 1c Logical unit 2c Logical unit 3 •YARN enabled • Resource Managed • MESOS support on the roadmap
  8. 8. Introduction Apex Kudu Q&A Apex - application layout RESTAPI Bare Metal/ Cloud/Virtual Hadoop - HDFS + YARN All major distributions supported Apex Streaming Runtime High performance,Fault Tolerant,In-memory App nApp 2App 1 … APEX STACK APEX-CLI Command line client YARN RESOURCE MANAGER YARN NM M C C YARN NM C C M YARN NM C M C Submit Job/ Manipulate Highly Available App Master (M) - Apex AM Highly available compute units HDFS for Persistent Storage Capitaliseonexisting Hadoopinvestments Varied compute profiles Colocate YARN traditional MR along with Apex Applications - No Disruption
  9. 9. Introduction Apex Kudu Q&A Apex application development model •Stream is a sequence of data tuples • An Operator consumes one or more input streams , processes tuples using custom business logic and emits to one or more output streams • DAG is made up of operators and streams • Rich collection of operators available from Apache Malhar • NOSQL - Cassandra, Geode .. • Kudu •Relational - JDBC • Messaging - Kafka,JMS , Solace • File Systems - HDFS , S3, NFS • Nifi •…. Kafka Input Kafka Input Kafka Input Kudu Output Kudu output Cassandra Enrich Cassandra Enrich Cassandra Enrich Model Scoring Model Scoring Model Scoring Model Scoring Model Scoring Model Scoring
  10. 10. Introduction Apex Kudu Q&A Apex application deployment model • Non-intrusive model to meet overall SLAs • Different operators can be configured independently to meet SLA needs. • Compute intensive vs I/O intensive • Custom stream codecs enable configurable tuple routing patterns KI CE MS KO Logical model KI KI KI KO KO CE CE CE MS MS MS MS MS MS Each operator needs its own scaling to meet SLAs Deployment configuration
  11. 11. Introduction Apex Kudu Q&A Unifiers Functionality specific scaling is causing backpressure on downstream operators KI CE MS KO CE CE CE CE CE MS MS U CE CE CE CE CE MS MS U U Logical Plan Bottleneck @Unifier Scaled up unifiers
  12. 12. Introduction Apex Kudu Q&A Parallel partitioning I want to avoid shuffles for the lowest cross operator latencies KI CE MS KOCO Logical Plan CEKI MS CEKI MS CEKI MS CO KO Parallel Partitioning
  13. 13. Introduction Apex Kudu Q&A Dynamic partitioning • Utilise hardware for nightly batch compute needs • Ex: Zip code based average driving speeds for HAR Features But most of the activity feeds are only during day time KI CE MS KOCO Logical Plan KI CEKI MS KI CO KO Dynamic Partitioning CEKI MS CEKI MS CEKI MS CO KO Daytime topology Nighttime topology
  14. 14. Introduction Apex Kudu Q&A Pub sub • High performant in-memory pub-sub messaging • Provides ordering & idempotency for failure scenarios • Buffers tuples in memory until the tuples are committed • Spills to disk in back-pressure scenarios I want a loosely coupled operator binding for throughput handling, recovery … T T T T T T Cassandra Operator T T T T T T Model Scoring CEKI MS MS X X CEKI MS MS Recoverability in parts of DAG
  15. 15. Introduction Apex Kudu Q&A Checkpointing • Non-intrusive streaming window markers • In-memory processing of data & checkpointing at streaming window boundaries • Configurable checkpoint store - HDFS backed store replicated & highly available • One or multiple windows ( configurable ) make a checkpoint boundary • Persist non-transient & Operator specific checkpointing data structure - Ex: Kafka : (C,T,P,O) CHECKPOINTS But machines fail / Application needs upgrade T T T T T T T T T T T T T T TB B B B BE E E E ECP CP R R Streaming Window CP Checkpoint state B Begin Streaming Window E End Streaming Window
  16. 16. Introduction Apex Kudu Q&A Processing guarantees But machines fail / Application needs upgrade B T T T T T T T T T T TB B BE E ECP R X At Least Once T T T T T T T T T T TB B BE E ECP R X At most Once T T T BE To downstream Operator To downstream Operator Exactly Once = At least once + Idempotency ( Pub-Sub ) + Operator logic
  17. 17. Introduction Apex Kudu Q&A Kudu output operator - Exactly once I want exactly once semantics but Kudu does not support transactions Exactly Once = At least once + Idempotency ( Pub-Sub ) + Operator logic T T T T T T T T T T TB B BE E ECP R X Exactly Once - Upstream window processing view T T TBET T T T Safe Mode Window/s Reconciling Mode Normal Mode Automatically Skipped Business Logic callback to Exactly detect already written records
  18. 18. Introduction Apex Kudu Q&A Apex Command line client • Apex Command line client provides capabilities for • Launching an apex application on the cluster • Specifying configuration files and properties • Managing lifecycle of an application - Kill, shutdown • Change the logical plan of the running application • We can add a new R operator with different configurations as a champion challenger • Control operator properties at runtime • Ex: Change the throttle config in the Kudu Input operator • No downtime !! I want to fine tune the champion challenger scoring model at runtime as an experiment KuduI R CO Deployed application KuduI R CO Experimentation mode application R CO
  19. 19. Introduction Apex Kudu Q&A Apex Kudu Integration • Kudu Input Operator • Scans a single table using a SQL expression using a distributed scan approach • ANTLR4 parser compensates for the missing JDBC driver for Kudu. • Kudu Output Operator • Used to mutate a single table basing on the context. Supports • Insert • Update • Upsert • Delete • Available post 3.8.0 release of Malhar I want to integrate Kudu
  20. 20. Introduction Apex Kudu Q&A Kudu Output Operator • Same POJO mapping to multiple tables •No extra transformation required • Automatic schema detection • Override Column name mapping if required Single Kafka payload Message translates to Device and Activity tables CEKI MS CEKI MS CEKI MS KO1 KO2 DeviceID First Seen LastSeen LastKnownGeo • Can choose to write only a subset of the column • Ex: LastSeen can be updated without reading FirstSeen Not all columns of the HAR device data is sent all of the time
  21. 21. Introduction Apex Kudu Q&A Kudu output operator Autometrics •Apex engine allows for metrics collection and monitoring • Termed as Autometrics • Metrics are automatically aggregated over the entire instances of the operator • Supports complex types as a metric construct • Metrics are also available as a RESTAPI endpoint. • Metrics supported by the Kudu output operator • On a per window basis • Inserts,updates,upserts,deletes, bytes written, write operations, write RPCs, RPC errors, Operational errors • On a global basis ( i.e. from start of application ) • Same as above I want to monitor kudu Operational metrics
  22. 22. Introduction Apex Kudu Q&A Kudu Input Operator • Scans a single kudu table • Streams one row as POJO tuple to downstream Operators • Accepts a SQL expression to determine the rows that need to be read • The query processing is distributed across • All Apex Operators that divide the stream work equally • Disruptor Queue for maximum throughput I want to scan and stream data in Kudu KuduI R KO KuduI R KO KuduI R KO Query Plan
  23. 23. Introduction Apex Kudu Q&A Kudu Input Operator Partitioning options I want to optimise the second application basing on the number of kudu tablets KuduI R KO KuduI R KO KuduI R KO One to One mapping KuduI R KO KuduI R KO Many to One mapping KuduI R KO Operator config allows for flexible Kudu tablet to Apex operator mapping
  24. 24. Introduction Apex Kudu Q&A Kudu Input Operator Fault tolerance Can I make use of Kudu replication to account for HA of input stream processing KuduI R KO KuduI R KO KuduI R KO
  25. 25. Introduction Apex Kudu Q&A Kudu Input Operator Scan ordering • Simple configuration switch to choose between random order & consistent order • Consistent ordering •Automatically sets Fault tolerance to true • Exactly once processing only possible in Consistent ordering mode • Results in lower throughput Can I tune for throughput or exactly once semantics basing on my requirements Random order scanning KuduI R KO Consistent order scanning KuduI R KO
  26. 26. Introduction Apex Kudu Q&A Kudu Input Operator Control tuples • Apex allows for control tuples ( user defined watermarks ) to be intermixed the data tuples flowing in the DAG • Kudu Input operator currently allows for • Begin Query control tuple • End query control tuple • Control tuples are custom definable • Ex: New query expression in a begin query control tuple • Ex: Window time value at the end of the query processing • Control tuples can be sent either sent at window boundaries or inline • It is inline for Kudu Input operator My model needs a different scoring approach based on the data set time window Control tuple flow T Kudu Input operator T T TEQBQ T T T TB B R operator R1 R1EQBQR2R2
  27. 27. Introduction Apex Kudu Q&A Kudu input operator extensibility Time travel operator •As part of SQL expression allows for setting • Control Tuple END query message • Kudu READ_SNAPSHOT_TIME •Time Travel operator •Each input query can scan the entire table ( with appropriate filters ) for data present at specified READ_SNAPSHOT_TIME time • “SELECT * FROM TABLE where col1 = 234 using options READ_SNAPSHOT_TIME = <3 A.M>” I want to run a nightly model basing on the state of data at hourly boundaries during the daytime
  28. 28. Introduction Apex Kudu Q&A Production References • GE prefix platform processes IOT streaming data for analytics at sub-millisecond time frames • Capitol One • 99.999 % uptime 24x7 • Single digit millisecond end to end latencies • Threatmetrix data pipelines for visualising fraud patterns were processed at single digit millisecond processing latencies • These times exclude the latencies to write to a Cassandra cluster • A leading global financial institution ( non-AUS) • Demonstrate AML compliance • Integrate with Teradata,Vertica and Hadoop
  29. 29. Introduction Apex Kudu Q&A Q & A • Apex Community http://apex.apache.org/ community.html • Docs http://apex.apache.org/docs.html •Powered by Apache Apex http://apex.apache.org/ powered-by-apex.html • REST-API Server  https://github.com/atrato/atrato-server • Twitter handle https://twitter.com/apacheapex • Examples https://github.com/apache/apex-malhar/tree/ master/examples 030201 https://www.linkedin.com/in/ananth-kalyan-chakravarthy-ph-d-7a46156/ @_ananth_g

×