Successfully reported this slideshow.
We use your LinkedIn profile and activity data to personalize ads and to show you more relevant ads. You can change your ad preferences anytime.

Apache Flink Training - Deployment & Operations

3,110 views

Published on

What you need to know before putting a Flink job into production.

Published in: Technology
  • DOWNLOAD THIS BOOKS INTO AVAILABLE FORMAT (Unlimited) ......................................................................................................................... ......................................................................................................................... Download Full PDF EBOOK here { https://tinyurl.com/yyxo9sk7 } ......................................................................................................................... Download Full EPUB Ebook here { https://tinyurl.com/yyxo9sk7 } ......................................................................................................................... ACCESS WEBSITE for All Ebooks ......................................................................................................................... Download Full PDF EBOOK here { https://tinyurl.com/yyxo9sk7 } ......................................................................................................................... Download EPUB Ebook here { https://tinyurl.com/yyxo9sk7 } ......................................................................................................................... Download doc Ebook here { https://tinyurl.com/yyxo9sk7 } ......................................................................................................................... ......................................................................................................................... ......................................................................................................................... .............. Browse by Genre Available eBooks ......................................................................................................................... Art, Biography, Business, Chick Lit, Children's, Christian, Classics, Comics, Contemporary, Cookbooks, Crime, Ebooks, Fantasy, Fiction, Graphic Novels, Historical Fiction, History, Horror, Humor And Comedy, Manga, Memoir, Music, Mystery, Non Fiction, Paranormal, Philosophy, Poetry, Psychology, Religion, Romance, Science, Science Fiction, Self Help, Suspense, Spirituality, Sports, Thriller, Travel, Young Adult,
       Reply 
    Are you sure you want to  Yes  No
    Your message goes here

Apache Flink Training - Deployment & Operations

  1. 1. 1 Apache Flink® Training Flink v1.3.0 – 11.09.2017 Apache Flink® Training Deployment & Operations
  2. 2. What you’ll learn in this session  Capacity planning  Deployment options for Flink  Deployment best practices  Tuning: Work distribution  Tuning: Memory configuration  Tuning: Checkpointing  Tuning: Serialization  Lessons learned 2
  3. 3. This is an interactive session  Please interrupt me at any time if you have a question 3
  4. 4. Capacity Planning 4
  5. 5. First step: do the math!  Think through the resource requirements of your problem • Number of keys, state per key • Number of records, record size • Number of state updates • What are your SLAs? (downtime, latency, max throughput)  What resources do you have? • Network capacity (including Kafka, HDFS, etc) • Disk bandwidth (RocksDB relies on the local disk) • Memory • CPUs 5
  6. 6. Establish a baseline  Normal operation should be able to avoid back pressure  Add a margin for recovery – these resources will be used to “catch up”  Establish your baseline with checkpointing enabled 6
  7. 7. Consider spiky loads  For example, operators downstream from a Window won’t be continuously busy  So how much downstream parallelism is required depends on how quickly you expect to process these spikes 7
  8. 8. Example: The Setup  Data: • Message size: 2 KB • Throughput: 1,000,000 msg/sec • Distinct keys: 500,000,000 (aggregation in window: 4 longs per key) • Checkpoint every minute 8 Kafka Source keyBy userId Sliding Window 5m size 1m slide Kafka Sink RocksDB
  9. 9. Example: The setup  Hardware: • 5 machines • 10 gigabit Ethernet • Each machine running a Flink TaskManager • Disks are attached via the network  Kafka is separate 9 TM 1 TM 2 TM 3 TM 4TM 5 NAS Kafka
  10. 10. Example: A machine’s perspective 10 TaskManager n Kafka Source keyBy window Kafka Sink Kafka: 400 MB/s 10 Gigabit Ethernet (Full Duplex) In: 1250 MB/s 10 Gigabit Ethernet (Full Duplex) Out: 1250 MB/s 2 KB * 1,000,000 = 2GB/s 2GB/s / 5 machines = 400 MB/s Shuffle: 320 MB/s 80MB/s Shuffle: 320 MB/s 400MB/s / 5 receivers = 80MB/s 1 receiver is local, 4 remote: 4 * 80 = 320 MB/s out Kafka: 67 MB/s
  11. 11. Excursion 1: Window emit 11 How much data is the window emitting? Recap: 500,000,000 unique users (4 longs per key) Sliding window of 5 minutes, 1 minute slide Assumption: For each user, we emit 2 ints (user_id, window_ts) and 4 longs from the aggregation = 2 * 4 bytes + 4 * 8 bytes = 40 bytes per key 100,000,000 (users) * 40 bytes = 4 GB every minute from each machine
  12. 12. Example: A machine’s perspective 12 TaskManager n Kafka Source keyBy window Kafka Sink Kafka: 400 MB/s 10 Gigabit Ethernet (Full Duplex) In: 1250 MB/s 10 Gigabit Ethernet (Full Duplex) Out: 1250 MB/s 2 KB * 1,000,000 = 2GB/s 2GB/s / 5 machines = 400 MB/s Shuffle: 320 MB/s 80MB/s Shuffle: 320 MB/s 400MB/s / 5 receivers = 80MB/s 1 receiver is local, 4 remote: 4 * 80 = 320 MB/s out Kafka: 67 MB/s 4 GB / minute => 67 MB/ second (on average)
  13. 13. Example: Result 13 TaskManager n Kafka Source keyBy window Kafka Sink Kafka: 400 MB/s 10 Gigabit Ethernet (Full Duplex) In: 1250 MB/s 10 Gigabit Ethernet (Full Duplex) Out: 1250 MB/s Shuffle: 320 MB/s 80MB/s Shuffle: 320 MB/s Kafka: 67 MB/s Total In: 720 MB/s Total Out: 387 MB/s
  14. 14. Example: Result 14 TaskManager n Kafka Source keyBy window Kafka Sink Kafka: 400 mb/s 10 Gigabit Ethernet (Full Duplex) In: 1250 mb/s 10 Gigabit Ethernet (Full Duplex) Out: 1250 mb/s Shuffle: 320 mb/s 80mb/s Shuffle: 320 mb/s Kafka: 67 mb/s Total In: 720 mb/s Total Out: 387 mb/s WRONG. We forgot: • Disk Access to RocksDB • Checkpointing
  15. 15. Example: Intermediate Result 15 TaskManager n Kafka Source keyBy window Kafka Sink Kafka: 400 MB/s 10 Gigabit Ethernet (Full Duplex) In: 1250 MB/s 10 Gigabit Ethernet (Full Duplex) Out: 1250 MB/s Shuffle: 320 MB/s Shuffle: 320 MB/s Kafka: 67 MB/s Disk read: ? Disk write: ?
  16. 16. Excursion 2: Window state access 16 How is the Window operator accessing state? Recap: 1,000,000 msg/sec. Sliding window of 5 minutes, 1 minute slide Assumption: For each user, we store 2 ints (user_id, window_ts) and 4 longs from the aggregation = 2 * 4 bytes + 4 * 8 bytes = 40 bytes per key 5 minute window (window_ts) key (user_id) value (long, long, long, long) Incoming data For each incoming record, update aggregations in 5 windows
  17. 17. Excursion 2: Window state access 17 How is the Window operator accessing state? For each key-value access, we need to retrieve 40 bytes from disk, update the aggregates and put 40 bytes back per machine: 40 bytes * 5 windows * 200,000 msg/sec = 40 MB/s
  18. 18. Example: Intermediate Result 18 TaskManager n Kafka Source keyBy window Kafka Sink Kafka: 400 MB/s 10 Gigabit Ethernet (Full Duplex) In: 1250 MB/s 10 Gigabit Ethernet (Full Duplex) Out: 1250 MB/s Shuffle: 320 MB/s Shuffle: 320 MB/s Kafka: 67 MB/s Total In: 760 MB/s Total Out: 427 MB/s Disk read: 40 MB/s Disk write: 40 MB/s
  19. 19. Excursion 3: Checkpointing 19 How much state are we checkpointing? per machine: 40 bytes * 5 windows * 100,000,000 keys = 20 GB We checkpoint every minute, so 20 GB / 60 seconds = 333 MB/s
  20. 20. Example: Final Result 20 TaskManager n Kafka Source keyBy window Kafka Sink Kafka: 400 MB/s 10 Gigabit Ethernet (Full Duplex) In: 1250 MB/s 10 Gigabit Ethernet (Full Duplex) Out: 1250 MB/s Shuffle: 320 MB/s Shuffle: 320 MB/s Kafka: 67 MB/s Total In: 760 MB/s Total Out: 760 MB/s Disk read: 40 MB/s Disk write: 40 MB/s Checkpoints: 333 MB/s
  21. 21. Example: Network requirements 21 TM 2 TM 3 TM 4 TM 5 TM 1 NAS In: 760 MB/s Out: 760 MB/s 5x 80 MB/s = 400 MB/s Kafka 400 MB/s * 5 + 67 MB/s * 5 = 2335 MB/s Overall network traffic: 2 * 760 * 5 + 400 + 2335 = 10335 MB/s = 82,68 Gigabit/s
  22. 22. Disclaimer!  This was just a “back of the napkin” calculation  Ignored network factors • Protocol overheads (Ethernet, IP, TCP, …) • RPC (Flink‘s own RPC, Kafka, checkpoint store) • Checkpointing causes network bursts • A window emission causes bursts • Remote disk access is not accounted for on the 10 GigE on AWS • Other systems using the network  CPU, memory, disk access speed have all been ignored 22
  23. 23. (High Availability) Deployment 23
  24. 24. Flexible Deployment Options  Hadoop YARN integration • Cloudera, Hortonworks, MapR, … • Amazon Elastic MapReduce (EMR), Google Cloud dataproc  Mesos & DC/OS integration  Standalone Cluster (“native”) • provided bash scripts • provided Docker images 24 Flink in containerland DAY 3 / 3:20 PM - 4:00 PM MASCHINENHAUS
  25. 25. Flexible Deployment Options  Docs and best-practices coming soon for • Kubernetes • Docker Swarm  Check Flink Documentation details! 25 Flink in containerland DAY 3 / 3:20 PM - 4:00 PM MASCHINENHAUS
  26. 26. 26 High Availability Deployments
  27. 27. YARN / Mesos HA  Run only one JobManager  Restarts managed by the cluster framework  For HA on YARN, we recommend using at least Hadoop 2.5.0 (due to a critical bug in 2.4)  Zookeeper is always required 27
  28. 28. Standalone cluster HA  Run standby JobManagers  Zookeeper manages JobManager failover and restarts  TaskManager failures are resolved by the JobManager  Use custom tool to ensure a certain number of Job- and TaskManagers 28
  29. 29. Deployment Best Practices Things you should consider before putting a job in production 29
  30. 30. Choose your state backend Name Working state State backup Snapshotting RocksDBStateBackend Local disk (tmp directory) Distributed file system Asynchronously • Good for state larger than available memory • Rule of thumb: 10x slower than memory-based backends FsStateBackend JVM Heap Distributed file system Synchronous / Async • Fast, requires large heap MemoryStateBackend JVM Heap JobManager JVM Heap Synchronous / Async • Good for testing and experimentation with small state (locally) 30
  31. 31. Asynchronous Snapshotting 31 SyncAsync
  32. 32. Explicitly set max parallelism  Changing this parameter is painful • requires a complete restart and loss of all checkpointed/savepointed state  0 < parallelism <= max parallelism <= 32768  Max parallelism > 128 has some impact on performance and state size 32
  33. 33. Set UUIDs for all (stateful) operators  Operator UUIDs are needed to restore state from a savepoint  Flink will auto—generate UUIDs, but this results in fragile snapshots.  Setting UUIDs in the API: DataStream<String> stream = env .addSource(new StatefulSource()) .uid("source-id") // ID for the source operator .map(new StatefulMapper()) .uid("mapper-id") // ID for the mapper .print(); 33
  34. 34. Use the savepoint tool for deletions  Savepoint files contain only metadata and depend on the checkpoint files • bin/flink savepoint -d :savepointPath  There is work in progress to make savepoints self-contained  deletion / relocation will be much easier 34
  35. 35. Avoid the deprecated state APIs  Using the Checkpointed interface will prevent you from rescaling your job  Use ListCheckpointed (like Checkpointed, but redistributeble) or CheckpointedFunction (full flexibility) instead. Production Readiness Checklist: https://ci.apache.org/projects/flink/flink-docs-release- 1.3/ops/production_ready.html 35
  36. 36. Tuning: CPU usage / work distribution 36
  37. 37. Configure parallelism / slots  These settings influence how the work is spread across the available CPUs  1 CPU per slot is common  multiple CPUs per slot makes sense if one slot (i.e. one parallel instance of the job) performs many CPU intensive operations 37
  38. 38. Operator chaining 38
  39. 39. Task slots 39
  40. 40. Slot sharing (parallelism now 6) 40
  41. 41. What can you do?  Number of TaskManagers vs number of slots per TM  Set slots per TaskManager  Set parallelism per operator  Control operator chaining behavior  Set slot sharing groups to break operators into different slots 42
  42. 42. Tuning: Memory Configuration 44
  43. 43. Memory in Flink (on YARN) 45 YARN Container Limit JVM Heap (limited by Xmx parameter) Other JVM allocations: Classes, metadata, DirectByteBuffers JVM process size Netty RocksDB? Network buffersInternal Flink services User code (window contents, …) Memory Manager
  44. 44. Example: Memory in Flink 46 YARN Container Limit: 2000 MB JVM Heap: Xmx: 1500MB = 2000 * 0.75 (default cutoff is 25%) Other JVM allocations: Classes, metadata, stacks, … JVM process size: < 2000 MB Netty ~64MB RocksDB? MemoryManager? up to 70% of the available heap  TaskManager: 2000 MB on YARN “containerized.heap-cutoff-ratio” Container request size “taskmanager.memory.fraction“ RocksDB Config Network Buffers „taskmanager.network. memory.min“ (64MB) and „.max“ (1GB)
  45. 45. RocksDB  If you have plenty of memory, be generous with RocksDB (note: RocksDB does not allocate its memory from the JVM’s heap!). When allocating more memory for RocksDB on YARN, increase the memory cutoff (= smaller heap)  RocksDB has many tuning parameters.  Flink offers predefined collections of options: • SPINNING_DISK_OPTIMIZED_HIGH_MEM • FLASH_SSD_OPTIMIZED 47
  46. 46. Tuning: Checkpointing 48
  47. 47. Checkpointing  Measure, analyze and try out!  Configure a checkpointing interval • How much can you afford to reprocess on restore? • How many resources are consumed by the checkpointing? (cost in throughput and latency)  Fine-tuning • “min pause between checkpoints” • “checkpoint timeout” • “concurrent checkpoints”  Configure exactly once / at least once • exactly once does buffer alignment spilling (can affect latency) 49
  48. 48. 50
  49. 49. Tuning: Serialization 51
  50. 50. (de)serialization is expensive  Getting this wrong can have a huge impact  But don’t overthink it 52
  51. 51. Serialization in Flink  Flink has its own serialization framework, which is used for • Basic types (Java primitives and their boxed form) • Primitive arrays and Object arrays • Tuples • Scala case classes • POJOs  Otherwise Flink falls back to Kryo 53
  52. 52. A note on custom serializers / parsers  Avoid obvious anti-patterns, e.g. creating a new JSON parser for every record 54 Source map() String keyBy()/ window()/ apply() Sink Source map() keyBy()/ window()/ apply()  Many sources (e.g. Kafka) can parse JSON directly  Avoid, if possible, to ship the schema with every record String
  53. 53. What else?  You should register types with Kryo, e.g., • env.registerTypeWithKryoSerializer(DateTime.class, JodaDateTimeSerializer.class)  You should register any subtypes; this can increase performance a lot  You can use serializers from other systems, like Protobuf or Thrift with Kyro by registering the types (and serializers)  Avoid expensive types, e.g. Collections, large records  Do not change serializers or type registrations if you are restoring from a savepoint 55
  54. 54. Conclusion 56
  55. 55. Tuning Approaches  1. Develop / optimize job locally • Use data generator / small sample dataset • Check the logs for warnings • Check the UI for backpressure, throughput, metrics • Debug / profile locally  2. Optimize on cluster • Checkpointing, parallelism, slots, RocksDB, network config, … 57
  56. 56. The usual suspects  Inefficient serialization  Inefficient dataflow graph • Too many repartitionings; blocking I/O  Slow external systems  Slow network, slow disks  Checkpointing configuration 58
  57. 57. Q & A Let’s discuss … 5 9
  58. 58. 6 Thank you! @rmetzger | @theplucas @dataArtisans
  59. 59. We are hiring! data-artisans.com/careers 61
  60. 60. Deployment: Security Bonus Slides 62
  61. 61. Outline 1. Hadoop delegation tokens 2. Kerberos authentication 3. SSL 63
  62. 62.  Quite limited • YARN only • Hadoop services only • Tokens expire Hadoop delegation tokens 64 DATA Job Task Task HDFS Kafka ZK WebUI CLI HTTP Akka token
  63. 63.  Keytab-based identity  Standalone, YARN, Mesos  Shared by all jobs Kerberos authentication 65 DATA Job Task Task HDFS Kafka ZK WebUI CLI HTTP Akka keytab
  64. 64. SSL  taskmanager.data.ssl.enabled: communication between task managers  blob.service.ssl.enabled: client/server blob service  akka.ssl.enabled: akka-based control connection between the flink client, jobmanager and taskmanager  jobmanager.web.ssl.enabled: https for WebUI 66 DATA Job Task Task HDFS Kafka ZK WebUI CLI HTTPS Akka keytab certs
  65. 65. Limitations  The clients are not authenticated to the cluster  All the secrets known to a Flink job are exposed to everyone who can connect to the cluster's endpoint  Exploring SSL mutual authentication 67
  66. 66. Network buffers 68
  67. 67. Configure network buffers  TaskManagers exchange data via permanent TCP connections  Each TM needs enough buffers to concurrently serve all outgoing and incoming connections  Configuration parameter: „taskmanager.network.numberOfBuffers“  As few as possible, maybe 2x the minimum • Avoid having too much data in buffers: delayed checkpointing / barrier alignment spilling 69
  68. 68. What are these buffers needed for? flink.apache.org 70 TaskManager 1 Slot 2 Map Keyed Window Slot 1 TaskManager 2 Slot 2 Slot 1 A small Flink cluster with 4 processing slots (on 2 Task Managers) A simple Job in Flink:
  69. 69. What are these buffers needed for? flink.apache.org 71 Job with a parallelism of 4 and 2 processing slots per Machine TaskManager 1 TaskManager 2 Slot1Slot2 Map Window Map Window Map Window Map Window Slot1Slot2 Network buffer 8 buffers for outgoing data 8 buffers for incoming data
  70. 70. What are these buffers needed for? flink.apache.org 72 Job with a parallelism of 4 and 2 processing slots per Machine TaskManager 1 TaskManager 2 Slot1Slot2 Map Window Map Window Map Window Map Window Slot1Slot2

×