Successfully reported this slideshow.
We use your LinkedIn profile and activity data to personalize ads and to show you more relevant ads. You can change your ad preferences anytime.

Apache Flink vs Apache Spark - Reproducible experiments on cloud.

4,938 views

Published on

http://shelan.org/blog/2016/01/31/reproducible-experiment-to-compare-apache-spark-and-apache-flink-batch-processing/

http://blog.ashansa.org/2016/02/stream-processing-is-becoming-crucial.html


Batch Processing.
https://github.com/karamel-lab/batch-processing-comparison

Stream Processing.
https://github.com/karamel-lab/stream-processing-comparison

Published in: Technology
  • Be the first to comment

Apache Flink vs Apache Spark - Reproducible experiments on cloud.

  1. 1. Reproducible distributed experiments on cloud vs Shelan Perera Ashansa Perera Kamal Hakimzadeh
  2. 2. “ Reproducing experiments with minimal effort
  3. 3. Spark and Flink ▷ Batch Processing vs. Stream Processing ▷ Micro Batching vs. Natural Data Flow ▷ Good fit for scalable deployment in the cloud
  4. 4. Motivation ▷ Validate Performance claims ▷ Take off deployment overhead ▷ Design reproducible experiments
  5. 5. Karamel => “Framework for reproducible distributed experiments”
  6. 6. Benchmark - Batch Teragen - To generate data (Hadoop) Terasort - Benchmarking Algorithm (Spark, Flink)
  7. 7. To make Dongwon Kim’s comparison reproducible. http://www.slideshare.net/ssuser6bb12d/a- comparative-performance-evaluation-of-apache-flink
  8. 8. 1 Namenode ⇒ Master (Low processing ) 2 Worker nodes ⇒ Slaves (High processing ) Our Deployment
  9. 9. EC2 Slave EC2 Master Deployment Hadoop Name Node Spark Master Flink Job Manager Spark Worker Flink Task Manager Hadoop Data Node Karamel x 2 Karamel Config
  10. 10. Configuration Master / Namenode 2.6 4 16 80 Slave / Worker 2.5 16 122 1600 CPU (GHz) No of vCPUs Memory (GB) Storage :SSD (GB) (m3.xlarge) (i2.4xlarge)
  11. 11. Experiment Hadoop MR : Teragen HDFS Spark/Flink : Terasort 200/ 400/ 600 GB
  12. 12. Results Batch Processing
  13. 13. Application Performance
  14. 14. Flink 1.5 x Faster than Spark
  15. 15. ▷ Spark : Does not overlap stages ▷ Flink : Do pipelining Mainly because...
  16. 16. Collectl- Monitor ● Tool used to collect and draw results. ● https://github. com/shelan/collectl- monitoring
  17. 17. System Performance -CPU (%)
  18. 18. System Performance -Memory (GB)
  19. 19. System Performance -Disk (MB/s)
  20. 20. System Performance -Network (KB/s)
  21. 21. Load Balancing -Workers (CPU %)
  22. 22. Load Balancing -Workers (CPU %)
  23. 23. Outcome ▷ Performance Comparison Results ▷ Karamel experiments to reproduce the same results with minimal effort
  24. 24. How not to reproduce “our problems”
  25. 25. EC2 claims 800 GB disks, But Disk File system (DF) does shows only 30GB. If you are using I2 or R3 instances you should create a file system and partition disks manually.
  26. 26. Large Spark or Flink Batch applications can fail with not enough disk space Configure Flink temp directory and Spark local directory to a partition with at least enough space to store the total input.
  27. 27. Reproducing experiments on EC2 may cost you a lot Spot instances which allow to reduce the cost by 10x is also supported by Karamel
  28. 28. IncompatibleClassChangeError when running StreamBench built for MR2 on hadoop2.x No explicitly defined dependencies for previous versions, but one of the dependencies (mahout) had internal references to hadoop1.x jar
  29. 29. Summary ▷ Introducing reproducible experiments on cloud ▷ Performance Comparison of Spark and Flink ▷ Reproducible experiments are available online (https://github.com/karamel-lab)
  30. 30. Thanks ..!!

×