Successfully reported this slideshow.
We use your LinkedIn profile and activity data to personalize ads and to show you more relevant ads. You can change your ad preferences anytime.
A Comparative Performance
Evaluation of Flink
Dongwon Kim
POSTECH
About Me
• Postdoctoral researcher @ POSTECH
• Research interest
• Design and implementation of distributed systems
• Perf...
Outline
• TeraSort for various engines
• Experimental setup
• Results & analysis
• What else for better performance?
• Con...
TeraSort
• Hadoop MapReduce program for the annual terabyte sort competition
• TeraSort is essentially distributed sort (D...
• Included in Hadoop distributions
• with TeraGen & TeraValidate
• Identity map & reduce functions
• Range partitioner bui...
• Tez can execute TeraSort for MapReduce w/o any modification
• mapreduce.framework.name = yarn-tez
• Tez DAG plan of Tera...
TeraSort for Spark & Flink
• My source code in GitHub:
• https://github.com/eastcirclek/terasort
• Sampling-based range pa...
RDD1 RDD2
• Code
• Two RDDs
TeraSort for Spark
Stage 1Stage 0
Shuffle-Map Task
(for newAPIHadoopFile)
read sort
Result Tas...
Pipeline
• Code
• Pipelines consisting of four operators
TeraSort for Flink
read shuffling writelocal sort
Create a datase...
Importance of TeraSort
• Suitable for measuring the pure performance of big data engines
• No data transformation (like ma...
Outline
• TeraSort for various engines
• Experimental setup
• Machine specification
• Node configuration
• Results & analy...
Machine specification (42 identical machines)
DELL PowerEdge R610
CPU
Two X5650 processors
(Total 12 cores)
Memory
Total 2...
24GB on each node
Node configuration
Total 2 GB
for daemons
13 GB
Tez-0.7.0
NodeManager (1 GB)
ShuffleService
MapTask (1GB...
Outline
• TeraSort for various engines
• Experimental setup
• Results & analysis
• Flink is faster than other engines due ...
How to read a swimlane graph & throughput graphs
Tasks
Time since job starts (seconds)
2nd stage
1st
2nd
3rd
4th
5th
6th
1...
Result of sorting 80GB/node (3.2TB)
1480 sec
1st stage
1st stage
1st stage
2nd stage
2157 sec
2nd stage
2171 sec
1 DataSou...
Tez and Spark do not overlap 1st and 2nd stages
Cluster network
throughput
Cluster disk throughput
In
Out
Disk read
Cluste...
Tez does not overlap 1st and 2nd stages
• Tez has parameters to control the degree of overlap
• tez.shuffle-vertex-manager...
Spark does not overlap 1st and 2nd stages
• Spark cannot execute multiple stages simultaneously
• also mentioned in the fo...
MapReduce is slow despite overlapping stages
• mapreduce.job.reduce.slowstart.completedMaps : [0.0, 1.0]
• Wang’s attempt ...
Disk
Data transfer between tasks of different stages
Output file
P1 P2 Pn
Shuffle server
…
Consumer
Task 1
Consumer
Task 2...
Flink causes fewer disk access during shuffling
Map
Reduce
Flink diff.
Total disk write
(TB)
9.9 6.5 3.4
Total disk read
(...
Result of TeraSort with various data sizes
node data size
(GB)
Time (seconds)
Flink Spark MapReduce Tez
10 157 387 259 277...
Result of HashJoin
• 10 slave nodes
• org.apache.tez.examples.JoinDataGen
• Small dataset : 256MB
• Large dataset : 240GB ...
Result of HashJoin with swimlane & throughput graphs
25
Idle
1 DataSource
2 DataSource
3 Join
4 DataSink
Idle
Cluster netw...
Flink’s shortcoming
• No support for map output compression
• Small data blocks are pipelined between operators
• Job-leve...
Low disk throughput during the post-shuffling phase
• Possible reason : sorting records from small files
• Concurrent disk...
Outline
• TeraSort for various engines
• Experimental setup
• Results & analysis
• What else for better performance?
• Con...
MR2 – another MapReduce engine
• PhD thesis
• MR2: Fault Tolerant MapReduce with the Push Model
• developed for 3 years
• ...
MR2 pipeline
• 7 types of components with memory buffers
1. Mappers & reducers : to apply user-defined functions
2. Prefet...
Prefetcher & Mappers
• Prefetcher loads data for multiple mappers
• Mappers do not read input from disks
<MR2><Hadoop MapR...
Push-model in MR2
• Node-to-node network connection for pushing data
• To reduce # network connections
• Data transfer fro...
Receiver’s managed memory
Receiver & merger & preloader & reducer
• Merger produces a file from different partition data
•...
Result of sorting 80GB/node (3.2TB) with MR2
MapReduce
in Hadoop-2.7.1
Tez-0.7.0 Spark-1.5.1 Flink-0.9.1 MR2
Time (sec) 21...
Disk & network throughput
1. DataSource / Mapping
• Prefetcher is effective
• MR2 shows higher disk
throughput
2. Partitio...
• Experimental results using 10 nodes
PUMA (PUrdue MApreduce benchmarks suite)
36
Outline
• TeraSort for various engines
• Experimental setup
• Experimental results & analysis
• What else for better perfo...
Conclusion
• Pipelined execution for both batch and streaming processing
• Even better than other batch processing engines...
Thank you!
Any question?
39
Upcoming SlideShare
Loading in …5
×

A Comparative Performance Evaluation of Apache Flink

5,826 views

Published on

I compare Apache Flink to Apache Spark, Apache Tez, and MapReduce in Apache Hadoop in terms of performance. I run experiments using two benchmarks, Terasort and Hashjoin.

Published in: Engineering

A Comparative Performance Evaluation of Apache Flink

  1. 1. A Comparative Performance Evaluation of Flink Dongwon Kim POSTECH
  2. 2. About Me • Postdoctoral researcher @ POSTECH • Research interest • Design and implementation of distributed systems • Performance optimization of big data processing engines • Doctoral thesis • MR2: Fault Tolerant MapReduce with the Push Model • Personal blog • http://eastcirclek.blogspot.kr • Why I’m here  2
  3. 3. Outline • TeraSort for various engines • Experimental setup • Results & analysis • What else for better performance? • Conclusion 3
  4. 4. TeraSort • Hadoop MapReduce program for the annual terabyte sort competition • TeraSort is essentially distributed sort (DS) a4 b3 a1 a2 b1 b2 a2 b1 a3 a4 b3 b4 Disk a2 a4 b3 b1 a1 b4 a3 b2 a1 a3 b4 b2 Disk a2 a4 a1 a3 b3 b1 b4 b2 read shufflinglocal sort write Disk Disk local sort Node 1 Node 2 Typical DS phases : 4Total order
  5. 5. • Included in Hadoop distributions • with TeraGen & TeraValidate • Identity map & reduce functions • Range partitioner built on sampling • To guarantee a total order & to prevent partition skew • Sampling to compute boundary points within few seconds TeraSort for MapReduce Reduce taskMap task read shuffling sortsort reducemap write read shufflinglocal sort writelocal sortDS phases : reducemap 5 Record range … Partition 1 Partition 2 Partition r boundary points
  6. 6. • Tez can execute TeraSort for MapReduce w/o any modification • mapreduce.framework.name = yarn-tez • Tez DAG plan of TeraSort for MapReduce TeraSort for Tez finalreduce vertex initialmap vertex Map task read sortmap Reduce task shuffling sort reduce write input data output data 6
  7. 7. TeraSort for Spark & Flink • My source code in GitHub: • https://github.com/eastcirclek/terasort • Sampling-based range partitioner from TeraSort for MapReduce • Visit my personal blog for a detailed explanation • http://eastcirclek.blogspot.kr 7
  8. 8. RDD1 RDD2 • Code • Two RDDs TeraSort for Spark Stage 1Stage 0 Shuffle-Map Task (for newAPIHadoopFile) read sort Result Task (for repartitionAndSortWithinPartitions) shuffling sort write read shufflinglocal sort writelocal sort Create a new RDD to read from HDFS # partitions = # blocks Repartition the parent RDD based on the user-specified partitioner Write output to HDFS DS phases : 8
  9. 9. Pipeline • Code • Pipelines consisting of four operators TeraSort for Flink read shuffling writelocal sort Create a dataset to read tuples from HDFS partition tuples Sort tuples of each partition DataSource Partition SortPartition DataSink local sort No map-side sorting due to pipelined execution Write output to HDFS DS phases : 9
  10. 10. Importance of TeraSort • Suitable for measuring the pure performance of big data engines • No data transformation (like map, filter) with user-defined logic • Basic facilities of each engine are used • “Winning the sort benchmark” is a great means of PR 10
  11. 11. Outline • TeraSort for various engines • Experimental setup • Machine specification • Node configuration • Results & analysis • What else for better performance? • Conclusion 11
  12. 12. Machine specification (42 identical machines) DELL PowerEdge R610 CPU Two X5650 processors (Total 12 cores) Memory Total 24Gb Disk 6 disks * 500GB/disk Network 10 Gigabit Ethernet My machine Spark team Processor Intel Xeon X5650 (Q1, 2010) Intel Xeon E5-2670 (Q1, 2012) Cores 6 * 2 processors 8 * 4 processors Memory 24GB 244GB Disks 6 HDD's 8 SSD's Results can be different in newer machines 12
  13. 13. 24GB on each node Node configuration Total 2 GB for daemons 13 GB Tez-0.7.0 NodeManager (1 GB) ShuffleService MapTask (1GB) DataNode (1 GB) MapTask (1GB) ReduceTask (1GB) ReduceTask (1GB) MapTask (1GB) MapTask (1GB) MapTask (1GB) … MapReduce-2.7.1 NodeManager (1 GB) ShuffleService DataNode (1 GB) MapTask (1GB) MapTask (1GB) ReduceTask (1GB) ReduceTask (1GB) MapTask (1GB) MapTask (1GB) MapTask (1GB) … 12 GB Flink Spark Spark-1.5.1 NodeManager (1 GB) Executor (12GB) Internal memory layout Various managers DataNode (1 GB) Task slot 1 Task slot 2 Task slot 12 ... Thread pool Flink-0.9.1 NodeManager (1 GB) TaskManager (12GB) DataNode (1 GB) Internal memory layout Various managers Task slot 1 Task slot 2 Task slot 12 ... Task threads Tez MapReduce ReduceTask (1GB)ReduceTask (1GB) 13 12 simultaneous tasks at most Driver (1GB) JobManager (1GB)
  14. 14. Outline • TeraSort for various engines • Experimental setup • Results & analysis • Flink is faster than other engines due to its pipelined execution • What else for better performance? • Conclusion 14
  15. 15. How to read a swimlane graph & throughput graphs Tasks Time since job starts (seconds) 2nd stage 1st 2nd 3rd 4th 5th 6th 15 Cluster network throughput Cluster disk throughput In Out Disk read Disk Write - 6 waves of 1st stage tasks - 1 wave of 2nd stage tasks - Two stages are hardly overlapped 1st stage 2nd stage 1st stage 2nd stage No network traffic during 1st stage Each line : duration of each task Different patterns for different stages
  16. 16. Result of sorting 80GB/node (3.2TB) 1480 sec 1st stage 1st stage 1st stage 2nd stage 2157 sec 2nd stage 2171 sec 1 DataSource 2 Partition 3 SortPartition 4 DataSink • Flink is the fastest due to its pipelined execution • Tez and Spark do not overlap 1st and 2nd stages • MapReduce is slow despite overlapping stages MapReduce in Hadoop-2.7.1 Tez-0.7.0 Spark-1.5.1 Flink-0.9.1 2nd stage 1887 sec 2157 1887 2171 1480 0 500 1000 1500 2000 2500 MapReduce in Hadoop-2.7.1 Tez-0.7.0 Spark-1.5.1 Flink-0.9.1 Time(seconds) 16* Map output compression turned on for Spark and Tez * *
  17. 17. Tez and Spark do not overlap 1st and 2nd stages Cluster network throughput Cluster disk throughput In Out Disk read Cluster network throughput Cluster disk throughput In Out Disk read Disk write Disk write Disk read Disk write Out In (1) 2nd stage starts (2) Output of 1st stage is sent (1) 2nd stage starts (2) Output of 1st stage is sent (1) Network traffic occurs from start Cluster network throughput (2) Write to HDFS occurs right after shuffling is done 1 DataSource 2 Partition 3 SortPartition 4 DataSink idle idle (3) Disk write to HDFS occurs after shuffling is done (3) Disk write to HDFS occurs after shuffling is done 17
  18. 18. Tez does not overlap 1st and 2nd stages • Tez has parameters to control the degree of overlap • tez.shuffle-vertex-manager.min-src-fraction : 0.2 • tez.shuffle-vertex-manager.max-src-fraction : 0.4 • However, 2nd stage is scheduled early but launched late scheduled launched 18
  19. 19. Spark does not overlap 1st and 2nd stages • Spark cannot execute multiple stages simultaneously • also mentioned in the following VLDB paper (2015) Spark doesn’t support the overlap between shuffle write and read stages. … Spark may want to support this overlap in the future to improve performance. Experimental results of this paper - Spark is faster than MapReduce for WordCount, K-means, PageRank. - MapReduce is faster than Spark for Sort. 19
  20. 20. MapReduce is slow despite overlapping stages • mapreduce.job.reduce.slowstart.completedMaps : [0.0, 1.0] • Wang’s attempt to overlap spark stages 0.05 (overlapping, default) 0.95 (no overlapping) 2157 sec 10% improvement 20 Wang proposes to overlap stages to achieve better utilization 10%??? Why Spark & MapReduce improve just 10%? 2385 sec 2nd stage 1st stage
  21. 21. Disk Data transfer between tasks of different stages Output file P1 P2 Pn Shuffle server … Consumer Task 1 Consumer Task 2 Consumer Task n P1 … P2 Pn Traditional pull model - Used in MapReduce, Spark, Tez - Extra disk access & simultaneous disk access - Shuffling affects the performance of producers Producer Task (1) Write output to disk (2) Request P1 (3) Send P1 Pipelined data transfer - Used in Flink - Data transfer from memory to memory - Flink causes fewer disk access during shuffling 21 Leads to only 10% improvement
  22. 22. Flink causes fewer disk access during shuffling Map Reduce Flink diff. Total disk write (TB) 9.9 6.5 3.4 Total disk read (TB) 8.1 6.9 1.2 Difference comes from shuffling Shuffled data are sometimes read from page cache Cluster disk throughput Disk read Disk write Disk read Disk write Cluster disk throughput FlinkMapReduce 22 Total amount of disk read/write equals to the area of blue/green region
  23. 23. Result of TeraSort with various data sizes node data size (GB) Time (seconds) Flink Spark MapReduce Tez 10 157 387 259 277 20 350 652 555 729 40 741 1135 1085 1709 80 1480 2171 2157 1887 160 3127 4927 4796 3950 23 100 1000 10000 10 20 40 80 160 Time(seconds) node data size (GB) Flink Spark MapReduce Tez What we’ve seen Log scale * Map output compression turned on for Spark and Tez
  24. 24. Result of HashJoin • 10 slave nodes • org.apache.tez.examples.JoinDataGen • Small dataset : 256MB • Large dataset : 240GB (24GB/node) • Result : 24 Visit my blog  Flink is ~2x faster than Tez ~4x faster than Spark 770 1538 378 0 200 400 600 800 1000 1200 1400 1600 1800 Tez-0.7.0 Spark-1.5.1 Flink-0.9.1 Time(seconds) * No map output compression for both Spark and Tez unlike in TeraSort
  25. 25. Result of HashJoin with swimlane & throughput graphs 25 Idle 1 DataSource 2 DataSource 3 Join 4 DataSink Idle Cluster network throughput Cluster disk throughput In Out Disk read Disk write Disk read Disk write In Out In Out Disk read Disk write Cluster network throughput Cluster disk throughput 0.24 TB 0.41 TB 0.60 TB 0.84 TB 0.68 TB 0.74 TB Overlap 2nd 3rd
  26. 26. Flink’s shortcoming • No support for map output compression • Small data blocks are pipelined between operators • Job-level fault tolerance • Shuffle data are not materialized • Low disk throughput during the post-shuffling phase 26
  27. 27. Low disk throughput during the post-shuffling phase • Possible reason : sorting records from small files • Concurrent disk access to small files  too many disk seeks  low disk throughput • Other engines merge records from larger files than Flink • “Eager pipelining moves some of the sorting work from the mapper to the reducer” • from MapReduce online (NSDI 2010) Flink Tez MapReduce 27
  28. 28. Outline • TeraSort for various engines • Experimental setup • Results & analysis • What else for better performance? • Conclusion 28
  29. 29. MR2 – another MapReduce engine • PhD thesis • MR2: Fault Tolerant MapReduce with the Push Model • developed for 3 years • Provide the user interface of Hadoop MapReduce • No DAG support • No in-memory computation • No iterative-computation • Characteristics • Push model + Fault tolerance • Techniques to boost up HDD throughput • Prefetching for mappers • Preloading for reducers 29
  30. 30. MR2 pipeline • 7 types of components with memory buffers 1. Mappers & reducers : to apply user-defined functions 2. Prefetcher & preloader : to eliminate concurrent disk access 3. Sender & reducer & merger : to implement MR2’s push model • Various buffers : to pass data between components w/o disk IOs • Minimum disk access (2 disk reads & 2 disk writes) • +1 disk write for fault tolerance W1 R2 W2R1 30 1 12 23 3 3 W3
  31. 31. Prefetcher & Mappers • Prefetcher loads data for multiple mappers • Mappers do not read input from disks <MR2><Hadoop MapReduce> Mapper1 processing Blk1 Mapper2 processing Blk2 Time Disk throughput CPU utilization 2 mappers on a node Blk1 Time Prefetcher Blk2 Blk3 Blk1 2 Blk1 1 Blk2 2 Blk2 1 Blk3 2 Blk3 1 Blk4 Blk4 2 Blk4 1 Disk throughput CPU utilization 2 mappers on a node 31
  32. 32. Push-model in MR2 • Node-to-node network connection for pushing data • To reduce # network connections • Data transfer from memory buffer • Mappers stores spills in send buffer • Spills are pushed to reducer sides by sender • Fault tolerance (can be turned on/off) • Input ranges of each spill are known to master for reproduce • For fast recovery • store spills on disk for fast recovery (extra disk write) 32 similar to Flink’s pipelined execution MR2 does local sorting before pushing data similar to Spark
  33. 33. Receiver’s managed memory Receiver & merger & preloader & reducer • Merger produces a file from different partition data • sorts each partition data • and then does interleaving • Preloader preloads each group into reduce buffer • Reducers do not read data directly from disks • MR2 can eliminate concurrent disk reads from reducers thanks to Preloader P1 P2 P3 P4 P1 P2 P3 P4 P1 P2 P3 P4 … … Preloader loads each group (1 disk access for 4 partitions) 33
  34. 34. Result of sorting 80GB/node (3.2TB) with MR2 MapReduce in Hadoop-2.7.1 Tez-0.7.0 Spark-1.5.1 Flink-0.9.1 MR2 Time (sec) 2157 1887 2171 1480 890 MR2 speedup over other engines 2.42 2.12 2.44 1.66 - 2157 1887 2171 1480 890 0 500 1000 1500 2000 2500 MapReduce in Hadoop-2.7.1 Tez-0.7.0 Spark-1.5.1 Flink-0.9.1 MR2 Time(seconds) 34
  35. 35. Disk & network throughput 1. DataSource / Mapping • Prefetcher is effective • MR2 shows higher disk throughput 2. Partition / Shuffling • Records to shuffle are generated faster from in MR2 3. DataSink / Reducing • Preloader is effective • Almost 2x throughput Disk read Disk write Out In Cluster network throughput Cluster disk throughput Out In Disk read Disk write Flink MR2 1 1 2 2 3 3 35
  36. 36. • Experimental results using 10 nodes PUMA (PUrdue MApreduce benchmarks suite) 36
  37. 37. Outline • TeraSort for various engines • Experimental setup • Experimental results & analysis • What else for better performance? • Conclusion 37
  38. 38. Conclusion • Pipelined execution for both batch and streaming processing • Even better than other batch processing engines for TeraSort & HashJoin • Shortcomings due to pipelined execution • No fine-grained fault tolerance • No map output compression • Low disk throughput during the post-shuffling phase 38
  39. 39. Thank you! Any question? 39

×