Successfully reported this slideshow.
We use your LinkedIn profile and activity data to personalize ads and to show you more relevant ads. You can change your ad preferences anytime.

Smart Partitioning with Apache Apex (Webinar)


Published on

Processing big data often requires running the same computations parallelly in multiple processes or threads, called partitions, with each partition handling a subset of the data. This becomes all the more necessary when processing live data streams where maintaining SLA is paramount. Furthermore, multiple different computations make up an application and each of them may have different partitioning needs. Partitioning also needs to adapt to changing data rates, input sources and other application requirements like SLA.

In this talk, we will introduce how Apache Apex, a distributed stream processing platform on Hadoop, handles partitioning. We will look at different partitioning schemes provided by Apex some of which are unique in this space. We will also look at how Apex does dynamic partitioning, a feature unique to and pioneered by Apex to handle varying data needs with examples. We will also talk about the different utilities and libraries that Apex provides for users to be able to affect their own custom partitioning.

Published in: Technology
  • Be the first to comment

Smart Partitioning with Apache Apex (Webinar)

  1. 1. Smart Partitioning with Apache Apex Pramod Immaneni, Architect, PMC member Thomas Weise, Architect & Co-founder, PMC member May 19th 2016
  2. 2. Stream Processing • Data from a variety of sources (IoT, Kafka, files, social media etc.) • Unbounded stream data ᵒ Batch can be processed as stream (but a stream is not a batch) • (In-memory) Processing with temporal boundaries (windows) • Stateful operations: Aggregation, Rules, … -> Analytics • Results stored to a variety of sinks or destinations ᵒ Streaming application can also serve data with very low latency 2 Browser Web Server Kafka Input (logs) Decompress, Parse, Filter Dimensions Aggregate Kafka Logs Kafka
  3. 3. Apache Apex Features • In-memory stream processing platform ᵒ Developed since 2012, ASF TLP since 04/2016 • Unobtrusive Java API to express (custom) logic • Scale out, distributed, parallel • High throughput & low latency processing • Windowing (temporal boundary) • Reliability, fault tolerance, operability • Hadoop native • Compute locality, affinity • Dynamic updates, elasticity 3
  4. 4. Big data processing & partitioning • Large amount of data to process • Data could be streaming in at high velocity • Pipelining and partitioning to solve the problem • Partitioning ᵒ Run same logic in multiple processes or threads ᵒ Each partition processes a subset of the data • Apex supports partitioning out of the box ᵒ Different partitioning schemes ᵒ Unification ᵒ Static & Dynamic Partitioning ᵒ Separation of processing logic from scaling decisions 4
  5. 5. Apex Platform Overview 5
  6. 6. Native Hadoop Integration 6 • YARN is the resource manager • HDFS used for storing any persistent state
  7. 7. Application Development Model 7  A Stream is a sequence of data tuples  A typical Operator takes one or more input streams, performs computations & emits one or more output streams • Each Operator is YOUR custom business logic in java, or built-in operator from our open source library • Operator has many instances that run in parallel and each instance is single-threaded  Directed Acyclic Graph (DAG) is made up of operators and streams Directed Acyclic Graph (DAG) Output Stream Tupl e Tupl e er Operator er Operator er Operator er Operator er Operator er Operator
  8. 8. Streaming Windows 8  Application window  Sliding window and tumbling window  Checkpoint window  No artificial latency
  9. 9. Partitioning 9 NxM PartitionsUnifier 0 1 2 3 Logical DAG 0 1 2 1 1 Unifier 1 20 Logical Diagram Physical Diagram with operator 1 with 3 partitions 0 Unifier 1a 1b 1c 2a 2b Unifier 3 Physical DAG with (1a, 1b, 1c) and (2a, 2b): No bottleneck Unifier Unifier0 1a 1b 1c 2a 2b Unifier 3 Physical DAG with (1a, 1b, 1c) and (2a, 2b): Bottleneck on intermediate Unifier
  10. 10. Advanced Partitioning 10 0 1a 1b 2 3 4Unifier Physical DAG 0 4 3a2a1a 1b 2b 3b Unifier Physical DAG with Parallel Partition Parallel Partition Container uopr uopr1 uopr2 uopr3 uopr4 uopr1 uopr2 uopr3 uopr4 dopr dopr doprunifier unifier unifier unifier Container Container NICNIC NICNIC NIC Container NIC Logical Plan Execution Plan, for N = 4; M = 1 Execution Plan, for N = 4; M = 1, K = 2 with cascading unifiers Cascading Unifiers 0 1 2 3 4 Logical DAG
  11. 11. Dynamic Partitioning 11 • Partitioning change while application is running ᵒ Change number of partitions at runtime based on stats ᵒ Determine initial number of partitions dynamically • Kafka operators scale according to number of kafka partitions ᵒ Supports re-distribution of state when number of partitions change ᵒ API for custom scaler or partitioner 2b 2c 3 2a 2d 1b 1a1a 2a 1b 2b 3 1a 2b 1b 2c 3b 2a 2d 3a Unifiers not shown
  12. 12. Kafka Consumer Partitioning 12 1 to 1 Partitioning 1 to N Partitioning
  13. 13. File Reader Partitioning 13 File1 File2 File3 File4 File5 … Filen • User can specify number of partitions to start with • Partitions are created accordingly • Files are distributed among the partitions • User can change the number of partitions at runtime by setting a property • New partitions will get created automatically • Remaining files will be balanced among the new partition setc
  14. 14. Block Reader Partitioning 14 File1 File2 File3 File4 File5 … Filen • Users can specify number a minimum and maximum number of partitions. • New partitions are added on the fly as there are more blocks to read • Partitions are released when there are there are no pending blocks
  15. 15. Throughput Partitioner 15 • Scale based on thresholds for processed events • Demo: Location tracking, scale with number of active devices
  16. 16. How dynamic partitioning works 16 • Partitioning decision (yes/no) by trigger (StatsListener) ᵒ Pluggable component, can use any system or custom metric ᵒ Externally driven partitioning example: KafkaInputOperator • Stateful! ᵒ Uses checkpointed state ᵒ Ability to transfer state from old to new partitions (partitioner, customizable) ᵒ Steps: • Call partitioner • Modify physical plan, rewrite checkpoints as needed • Undeploy old partitions from execution layer • Release/request container resources • Deploy new partitions (from rewritten checkpoint) ᵒ No loss of data (buffered) ᵒ Incremental operation, partitions that don’t change continue processing • API: Partitioner interface
  17. 17. Writing custom partitioner 17 • Partitioner accepts set of partitions as of last checkpoint (frozen state) and returns set of new partitions • Access to operator state • Control partitioning for each input port • Can implement any custom logic to derive new from old partitions • StatsListener • public Response processStats(BatchedOperatorStats stats) • Throughput, latency, CPU, … • Partitioner • public Collection<Partition<T>> definePartitions(Collection<Partition<T>> partitions, PartitioningContext context) • Examples: malhar/tree/master/library/src/main/java/com/datatorrent/lib/partitioner
  18. 18. How tuples are split between partitions 18 • Tuple hashcode and mask used to determine destination partition ᵒ Mask picks the last n bits of the hashcode of the tuple ᵒ hashcode method can be overridden • StreamCodec can be used to specify custom hashcode for tuples ᵒ Can also be used for specifying custom serialization tuple: { Name, 24204842, San Jose } Hashcode: 00101010001 0101 Mask (0x11) Partition 00 1 01 2 10 3 11 4
  19. 19. Custom splits 19 • Custom distribution of tuples ᵒ E.g.. Broadcast tuple:{ Name, 24204842, San Jose } Hashcode: 00101010001 0101 Mask (0x00) Partition 00 1 00 2 00 3 00 4
  20. 20. Resources 20 • • Learn more: • Subscribe - • Download - • Follow @ApacheApex - • Meetups – • More examples: • Slideshare: • • Free Enterprise License for Startups -
  21. 21. Q&A 21