Successfully reported this slideshow.
We use your LinkedIn profile and activity data to personalize ads and to show you more relevant ads. You can change your ad preferences anytime.

Apache Flink Crash Course by Slim Baltagi and Srini Palthepu

10,256 views

Published on

In this hands-on Apache Flink presentation, you will learn in a step-by-step tutorial style about:

• How to setup and configure your Apache Flink environment: Local/VM image (on a single machine), cluster (standalone), YARN, cloud (Google Compute Engine, Amazon EMR, ... )?
• How to get familiar with Flink tools (Command-Line Interface, Web Client, JobManager Web Interface, Interactive Scala Shell, Zeppelin notebook)?
• How to run some Apache Flink example programs?
• How to get familiar with Flink's APIs and libraries?
• How to write your Apache Flink code in the IDE (IntelliJ IDEA or Eclipse)?
• How to test and debug your Apache Flink code?
• How to deploy your Apache Flink code in local, in a cluster or in the cloud?
• How to tune your Apache Flink application (CPU, Memory, I/O)?

Published in: Data & Analytics
  • Hello! High Quality And Affordable Essays For You. Starting at $4.99 per page - Check our website! https://vk.cc/82gJD2
       Reply 
    Are you sure you want to  Yes  No
    Your message goes here
  • Excellent! Very helpful as always - thank you!
       Reply 
    Are you sure you want to  Yes  No
    Your message goes here
  • step by step procedure helps to set the Flink environment in the cloud- Thanks. much appreciated!
       Reply 
    Are you sure you want to  Yes  No
    Your message goes here

Apache Flink Crash Course by Slim Baltagi and Srini Palthepu

  1. 1. Apache Flink Crash Course Slim Baltagi & Srini Palthepu with some materials from data-artisans.com Chicago Apache Flink Meetup August 4th 2015
  2. 2. 2 “One week of trials and errors can save you up to half an hour of reading the documentation.” Anonymous
  3. 3. 3 For an overview of Apache Flink, see our slides at http://goo.gl/gVOSp8 Gelly Table ML SAMOA DataSet (Java/Scala/Python) Batch Processing DataStream (Java/Scala) Stream Processing HadoopM/R Local Single JVM Embedded Docker Cluster Standalone YARN, Tez, Mesos (WIP) Cloud Google’s GCE Amazon’s EC2 IBM Docker Cloud, … GoogleDataflow Dataflow(WiP) MRQL Table Cascading(WiP) Runtime Distributed Streaming Dataflow Zeppelin DEPLOYSYSTEMAPIs&LIBRARIESSTORAGE Files Local HDFS S3 Tachyon Databases MongoDB HBase SQL … Streams Flume Kafka RabbitMQ … Batch Optimizer Stream Builder
  4. 4. 4 In this talk, we will cover practical steps for:  Setup and configuration of your Apache Flink environment  Using Flink tools  Learning Flink’s APIs & Domain Specific Libraries through  Some Apache Flink program examples  Free Training from Data Artisans in Java and Scala  Writing, testing, debugging, deploying and tuning your Flink applications
  5. 5. 5 Agenda 1. How to setup and configure your Apache Flink environment? 2. How to use Apache Flink tools? 3. How to learn Apache Flink’s APIs and its domain specific libraries? 4. How to set up your IDE (IntelliJ IDEA or Eclipse) for Apache Flink? 5. How to write, test and debug your Apache Flink program in an IDE? 6. How to deploy your Apache Flink application in local, in a cluster or in the cloud? 7. How to tune your Apache Flink application?
  6. 6. 6 1. How to setup and configure your Apache Flink environment? 1.1 Local (on a single machine) 1.2 VM image (on a single machine) 1.3 Docker 1.4 Standalone Cluster 1.5 YARN Cluster 1.6 Cloud
  7. 7. 7 1.1 Local (on a single machine) Flink runs on Linux, OS X and Windows. In order to execute a program on a running Flink instance (and not from within your IDE) you need to install Flink on your machine. The following steps will be detailed for both Unix- Like (Linux, OS X) as well as Windows environments: 1.1.1 Verify requirements 1.1.2 Download 1.1.3 Unpack 1.1.4 Check the unpacked archive 1.1.5 Start a local Flink instance 1.1.6 Validate Flink is running 1.1.7 Run a Flink example 1.1.8 Stop the local Flink instance
  8. 8. 8 1.1 Local (on a single machine) 1.1.1 Verify requirements The machine that Flink will run on must have Java 1.6.x or higher installed. In Unix-like environment, the $JAVA_HOME environment variable must be set. Check the correct installation of Java by issuing the following commands: java –version and also check if $Java- Home is set by issuing: echo $JAVA_HOME. If needed, follow the instructions for installing Java and Setting JAVA_HOME here: http://docs.oracle.com/cd/E19182-01/820-7851/inst_cli_jd
  9. 9. 9 1.1 Local (on a single machine) In Windows environment, check the correct installation of Java by issuing the following commands: java –version. Also, the bin folder of your Java Runtime Environment must be included in Window’s %PATH% variable. If needed, follow this guide to add Java to the path variable. http://www.java.com/en/download/help/path.xml 1.1.2 Download the latest stable release of Apache Flink from http://flink.apache.org/downloads.html For example: In Linux-Like environment, run the following command: wget https://www.apache.org/dist/flink/flink- 0.9.0/flink-0.9.0-bin-hadoop2.tgz
  10. 10. 10 1.1 Local (on a single machine) 1.1.3 Unpack the downloaded .tgz archive Example: $ cd ~/Downloads # Go to download directory $ tar -xvzf flink-*.tgz # Unpack the downloaded archive 1.1.4. Check the unpacked archive $ cd flink-0.9.0 The resulting folder contains a Flink setup that can be locally executed without any further configuration. flink-conf.yaml under flink-0.9.0/conf contains the default configuration parameters that allow Flink to run out-of-the-box in single node setups.
  11. 11. 11 1.1 Local (on a single machine)
  12. 12. 12 1.1 Local (on a single machine) 1.1.5. Start a local Flink instance: • Given that you have a local Flink installation, you can start a Flink instance that runs a master and a worker process on your local machine in a single JVM. This execution mode is useful for local testing. • On UNIX-Like system you can start a Flink instance as follows:  cd /to/your/flink/installation  ./bin/start-local.sh
  13. 13. 13 1.1 Local (on a single machine) 1.1.5. Start a local Flink instance: On Windows you can either start with: • Windows Batch Files by running the following commands  cd C:toyourflinkinstallation  .binstart-local.bat • or with Cygwin and Unix Scripts: start the Cygwin terminal, navigate to your Flink directory and run the start-local.sh script  $ cd /cydrive/c  cd flink  $ bin/start-local.sh
  14. 14. 14 1.1 Local (on a single machine) The JobManager (the master of the distributed system) automatically starts a web interface to observe program execution. In runs on port 8081 by default (configured in conf/flink-config.yml). http://localhost:8081/ 1.1.6 Validate that Flink is running You can validate that a local Flink instance is running by: • Issuing the following command: $jps jps: java virtual machine process status tool • Looking at the log files in ./log/ $tail log/flink-*-jobmanager-*.log • Opening the JobManager’s web interface at http://localhost:8081
  15. 15. 15 1.1 Local (on a single machine) 1.1.7 Run a Flink example • On UNIX-Like system you can run a Flink example as follows:  cd /to/your/flink/installation  ./bin/flink run ./examples/flink-java-examples-0.9.0- WordCount.jar • On Windows Batch Files, open a second terminal and run the following commands”  cd C:toyourflinkinstallation  .binflink.bat run .examplesflink-java- examples-0.9.0-WordCount.jar 1.1.8 Stop local Flink instance •On UNIX you call ./bin/stop-local.sh •On Windows you quit the running process with Ctrl+C
  16. 16. 16 1.2 VM image (on a single machine) Download Flink Virtual Machine from: https ://docs.google.com/uc?id=0B-oU5Z27sz1hZ0VtaW5idFViNU0&export= download The password is: flink This version works with VMware Fusion on OS X since there is no VMware player for OSX. https://www.vmware.com/products/fusion/fusion-evaluation.html
  17. 17. 17 1.3 Docker  Apache Flink cluster deployment on Docker using Docker-Compose By Romeo Kienzler. Talk at the Apache Flink Meetup Berlin planned for August 26, 2015 http ://www.meetup.com/Apache-Flink-Meetup/events/2239133 / The talk will: • Introduce the basic concepts on container isolation exemplified on Docker • Explain how Apache Flink is made elastic using Docker-Compose. • Show how to push the cluster to the cloud exemplified on the IBM Docker Cloud.
  18. 18. 18 1.4 Standalone Cluster  See quick start - Cluster setuphttps ://ci.apache.org/projects/flink/flink-docs-release-0.9/quickstart/setup_quickstart.html# setup See instructions on how to run Flink in a fully distributed fashion on a cluster. This involves two steps: • Installing and configuring Flink • Installing and configuring the Hadoop Distributed File System (HDFS) https://ci.apache.org/projects/flink/flink-docs-master/setup/cluster_setup.ht
  19. 19. 19 1.5 YARN Cluster You can easily deploy Flink on your existing YARN cluster. Download the Flink Hadoop2 package: Flink with Hadoop 2 http://www.apache.org/dyn/closer.cgi/flink/ Make sure your HADOOP_HOME (or YARN_CONF_DIR or HADOOP_CONF_DIR) environment variable is set to read your YARN and HDFS configuration.
  20. 20. 20 1.5 YARN Cluster Run the YARN client with: ./bin/yarn-session.sh You can run the client with options -n 10 -tm 8192 to allocate:  10 TaskManagers with 8GB of memory each. For more detailed instructions, check out the documentation: https ://ci.apache.org/projects/flink/flink-docs-master/se
  21. 21. 21 1.6 Cloud 1.6.1 Google Compute Engine (GCE) 1.6.2 Amazon EMR
  22. 22. 22 1.6 Cloud 1.6.1 Google Compute Engine Free trial for Google Cloud Engine: https://cloud.google.com/free-trial/ Enjoy your $300 in GCE for 60 days! Now, how to setup Flink with Hadoop 1 or Hadoop 2 on top of a Google Compute Engine cluster? Google’s bdutil starts a cluster and deploys Flink with Hadoop. To get started, just follow the steps here: https://ci.apache.org/projects/flink/flink-docs-master/setup/gce_setup.html
  23. 23. 23 1.6 Cloud 1.6.2 Amazon EMR Amazon Elastic MapReduce (Amazon EMR) is a web service providing a managed Hadoop framework. • http://aws.amazon.com/elasticmapreduce/ • http://docs.aws.amazon.com/ElasticMapReduce/latest/DeveloperGuide/emr-what-is emr.html •Example: Use Stratosphere with Amazon Elastic MapReduce, February 18, 2014 by Robert Metzgerhttps ://flink.apache.org/news/2014/02/18/amazon-elastic-mapreduce-cloud-yarn.html
  24. 24. 24 1.6 Docker Docker can be used for local development Often resource requirements on Data Processing Clusters exhibit high variation. Elastic deployments reduce TCO (Total Cost of Ownership). Container based virtualization; lightweight and portable; build once, run anywhere; ease of packaging applications; automated and scripted; isolated Apache Flink cluster deployment on Docker using Docker-Compose https://github.com/streamnsight/docker-flink
  25. 25. 25 2. How to use Apache Flink tools? 2.1 Command-Line Interface (CLI) 2.2 Job Client Web Interface 2.3 Job Manager Web Interface 2.4 Interactive Scala Shell 2.5 Zeppelin Notebook
  26. 26. 26 2.1 Command-Line Interface (CLI)  Example: ./bin/flink run ./examples/flink-java-examples- 0.9.0-WordCount.jar  bin/flink has 4 major actions • run #runs a program • info #displays information about a program. • list #lists running and finished programs. -r & -s ./bin/flink list -r -s • cancel #cancels a running program. –I  See more examples: https://ci.apache.org/projects/flink/flink-docs-master/apis/cli.html
  27. 27. 27 2.2 Job Client Web Interface Flink provides a web interface to: • Upload jobs • Inspect their execution plans • Execute them • Showcase programs • Debug execution plans • Demonstrate the system as a whole The web interface runs on port 8080 by default. To specify a custom port set the webclient.port property in the ./conf/flink.yaml configuration file.
  28. 28. 28 2.2 Job Client Web Interface Start the web interface by executing: ./bin/start-webclient.sh Stop the web interface by executing: ./bin/stop-webclient.sh • Jobs are submitted to the JobManager specified by jobmanager.rpc.address and jobmanager.rpc.port • For more details and further configuration options, please consult this webpage: https ://ci.apache.org/projects/flink/flink-docs-release-0.9/setup/config.html#webclient
  29. 29. 29 2.3   Job Manager Web Interface  The JobManager (the master of the  distributed system) starts a web interface  to observe program execution.   It runs on port 8081 by default (configured  in conf/flink-config.yml).  Open the JobManager’s web interface  at   http://localhost:8081 • jobmanager.rpc.port   6123  • jobmanager.web.port 8081 
  30. 30. 30 2.3   Job Manager Web Interface Overall system status Job execution details Task Manager resource utilization
  31. 31. 31 2.3 Job Manager Web Interface The JobManager web frontend allows to : • Track the progress of a Flink program as  all status changes are also logged to the  JobManager’s log file. • Figure out why a program failed as it  displays the exceptions of failed tasks  and allow to figure out which parallel task  first failed and caused the other tasks to  cancel the execution.
  32. 32. 32 2.4   Interactive Scala Shell Flink comes with an Interactive Scala Shell - REPL  ( Read Evaluate Print Loop ) :  ./bin/start-scala-shell.sh Interactive queries Let’s you explore data quickly Complete Scala API available It can be used in a local setup as well as in a  cluster setup.  The Flink Shell comes with command history and  auto completion. So far only batch mode is supported. There is plan  to add streaming in the future:  https://ci.apache.org/projects/flink/flink-docs-master/scala_shell.html  
  33. 33. 33 2.4   Interactive Scala Shell bin/start-scala-shell.sh --host localhost --port 6123
  34. 34. 34 2.4   Interactive Scala Shell Example 1:  Scala-Flink> val input = env.fromElements(1,2,3,4) Scala-Flink> val doubleInput = input.map(_ *2) Scala-Flink> doubleInput.print() Example 2:  Scala-Flink> val text = env.fromElements( "To be, or not to be,--that is the question:--", "Whether 'tis nobler in the mind to suffer", "The slings and arrows of outrageous fortune", "Or to take arms against a sea of troubles,") Scala-Flink> val counts = text.flatMap { _.toLowerCase.split("W+") }.map { (_, 1) }.groupBy(0).sum(1) Scala-Flink> counts.print()
  35. 35. 35 2.4   Interactive Scala Shell Problems with the Interactive Scala Shell: No visualization No saving   No replaying of written code No assistance as in an IDE
  36. 36. 36 2.5   Zeppelin Notebook Web-based interactive computation  environment  Combines rich text, execution code, plots  and rich media  Exploratory data science Storytelling 
  37. 37. 37 2.5   Zeppelin Notebook http://localhost:8080/
  38. 38. 38 3. How to learn Flink’s APIs and libraries? 3.1 How to run the examples in the Apache  Flink bundle? 3.2 How to learn Flink Programming APIs?  3.3 How to learn Apache Flink Libraries?
  39. 39. 39 3.1 How to run the examples in the Apache  Flink bundle? 3.1.1 Where are the examples? 3.1.2 Where are the related source  codes? 3.1.3 How to re-build these examples? 3.1.4 How to run these examples?
  40. 40. 40 3.1 How to run the examples in the Apache  Flink bundle? 3.1.1    Where are the examples?
  41. 41. 41 3.1 How to run the examples in the Apache  Flink bundle? The examples provided in the Flink bundle  showcase different applications of Flink from  simple word counting to graph algorithms. They illustrate the use of Flink’s API.  They are a very good way to learn how to  write Flink jobs.  A good starting point would be to modify  them! Now, where are the related source codes!?
  42. 42. 42 3.1 How to run the examples in the Apache  Flink bundle? 3.1.2    Where are the related source codes? You can find the source code of these  Flink examples in the flink-java-examples or  the flink-scala-examples of the flink- examples module of the source release of  Flink.   You can also access the source (and  hence the examples) through GitHub:  https://github.com/apache/flink/tree/master/flink-examples
  43. 43. 43 3.1 How to run the examples in the Apache  Flink bundle? 3.1.2    Where are the related source codes? If you don't want to import the whole Flink project  just for playing around with the examples, you can: • Create an empty maven project. This script will  automatically set everything up for you: $ curl  http://flink.apache.org/q/quickstart.sh | bash • Import the "quickstart" project into Eclipse or  IntelliJ. It will download all dependencies and  package everything correctly.  • If you want to use an example there, just copy the  Java file into the "quickstart" project. 
  44. 44. 44 3.1 How to run the examples in the Apache  Flink bundle? 3.1.3    How to re-build these examples? To build the examples, you can run:     "mvn clean package -DskipTests”     in the "flink-examples/flink-java-examples"  directory.  This will re-build them. 
  45. 45. 45 3.1 How to run the examples in the Apache  Flink bundle? 3.1.4 How to run these examples? How to display the command line  arguments?  ./bin/flink info ./examples/flink-java- examples-0.9.0-WordCount.jar Example of running an example: ./bin/flink  run ./examples/flink-java-examples-0.9.0- WordCount.jar More on the bundled examples:  https://ci.apache.org/projects/flink/flink-docs-master/apis/examples.html#running-an- example
  46. 46. 46 3.2 How to learn Flink Programming APIs?  3.2.1 DataSet API 3.2.2 DataStream API 3.2.3 Table API - Relational Queries
  47. 47. 47 3.2 How to learn Flink Programming APIs?  3.2.1 DataSet API https://ci.apache.org/projects/flink/flink-docs-master/apis/programming_guide.html https://ci.apache.org/projects/flink/flink-docs-master/api/java/ FREE Apache Flink Training by Data Artisans:  DataSet API Basis •Lecture:  http://dataartisans.github.io/flink-training/dataSetBasics/slides.html  Slides  https://www.youtube.com/watch?v=1yWKZ26NQeU Video •Exercise:  http://dataartisans.github.io/flink-training/dataSetBasics/handsOn.html
  48. 48. 48 3.2 How to learn Flink Programming APIs?  3.2.1 DataSet API DataSet API Advanced • Lecture:   • Slides  http://dataartisans.github.io/flink-training/dataSetAdvanced/slides.html • Video https://www.youtube.com/watch?v=1yWKZ26NQeU • Exercise:  http://dataartisans.github.io/flink-training/dataSetAdvanced/handsOn.html
  49. 49. 49 3.2 How to learn Flink Programming APIs?  3.2.2 DataStream API https://ci.apache.org/projects/flink/flink-docs-master/apis/streaming_guide.html https://ci.apache.org/projects/flink/flink-docs-master/api/java/ Example 1: Event pattern detection with Apache Flink This is a Flink streaming demo given By Data Artisans  on July 17, 2015  titled 'Apache Flink: Unifying batch   and streaming modern data analysis' at the Bay Area  Apache Flink Meetup:  • Related code: https ://github.com/StephanEwen/flink-demos/tree/master/streaming-state-machine • Related slides: http://www.slideshare.net/KostasTzoumas/first-flink-bay-area-meetup • Related video recording: https://www.youtube.com/watch?v=BJjGD8ijJcg
  50. 50. 50 3.2 How to learn Flink Programming APIs?  3.2.2 DataStream API Example 2: Fault-Tolerant Streaming with Flink Slides 16-23 http://www.slideshare.net/AljoschaKrettek/flink-010-upcoming-features Code https://github.com/aljoscha/flink-fault-tolerant-stream-example This is a demo to show how Flink can deal with stateful  streaming jobs and fault-tolerance.  Example 3: Flink-storm compatibility examplehttps ://github.com/apache/flink/tree/master/flink-contrib/flink-storm-compatibility/flink-storm-compatibili examples
  51. 51. 51 3.2 How to learn Flink Programming APIs?  3.2.2 DataStream API Example 4: Data Stream Analytics with Flink http://net.t-labs.tu-berlin.de/~nsemmler/blog//flink/2015/03/02/Data-Stream-Analysis-with- flink.html Example 5: Introducing Flink Streaming http://flink.apache.org/news/2015/02/09/streaming-example.html Examples from the code base: flink-streaming- examples https://github.com/apache/flink/tree/master/flink-staging/flink-streaming/flink-streaming-examples/s
  52. 52. 52 3.2 How to learn Flink Programming APIs?  3.2.3 Table API - Relational Queries           https ://ci.apache.org/projects/flink/flink-docs-master/libs/table.html To use the Table API in a project: • First setup a Flink program:  https://ci.apache.org/projects/flink/flink-docs-master/apis • Add this to the dependencies section of your  pom.xml <dependency> <groupId>org.apache.flink</groupId> <artifactId>flink-table</artifactId> <version>0.10-SNAPSHOT</version> </dependency> Table is not currently part of the binary distribution. You need to  link it for cluster execution:  https://ci.apache.org/projects/flink/flink-docs-master/apis/cluster_execution.html - linking-with-module
  53. 53. 53 3.2 How to learn Flink Programming APIs?  3.2.3 Table API - Relational Queries            FREE Apache Flink Training by Data Artisans – Table  API  • Lecture: http://www.slideshare.net/dataArtisans/flink- table • Exercise:  http://dataartisans.github.io/flink-training/tableApi/ handsOn.html See also example in slides 36-43 on Log Analysis  http://www.grid.ucy.ac.cy/file/Talks/talks/ DeepAnalysiswithApacheFlink_2nd_cloud_workshop.pdf
  54. 54. 54 3.3 Apache Flink Domain Specific Libraries 3.3.1 FlinkML - Machine Learning for Flink 3.3.2 Gelly - Graph Analytics for Flink
  55. 55. 55 3.3 Apache Flink Libraries 3.3.1 FlinkML - Machine Learning for Flinkhttps ://ci.apache.org/projects/flink/flink-docs-master/libs/ml/ FlinkML – Quickstart Guidehttps ://ci.apache.org/projects/flink/flink-docs-master/libs/ml/quickstart.html To use FlinkML in a project: • First setup a Flink program:  https://ci.apache.org/projects/flink/flink-docs-master/apis/programming_guide.html#linking-with-f • Add this to the dependencies section of your pom.xml  <dependency> <groupId>org.apache.flink</groupId> <artifactId>flink-ml</artifactId> <version>0.10-SNAPSHOT</version> </dependency>
  56. 56. 56 3.3 Apache Flink Libraries 3.3.1 FlinkML - Machine Learning for  Flink Quick Start: Run K-Means Examplehttps ://ci.apache.org/projects/flink/flink-docs-master/quickstart/run_example_quickstart.html Computing Recommendations at Extreme Scale with  Apache Flink  http://data-artisans.com/computing-recommendations-at-extreme-scale-with-apache-flink/a and  related code:  https://github.com/tillrohrmann/flink-perf/blob/ALSJoinBlockingUnified/flink-jobs/src/main/scala/com/g ALSJoinBlocking.scala Naive Bayes on Apache Flink  http://www.itshared.org/2015/03/naive-bayes-on-apache-flink.html FlinkML is not currently part of the binary distribution.  You need to link it for cluster execution:  https://ci.apache.org/projects/flink/flink-docs-master/apis/cluster_execution.html#linking-with-modules
  57. 57. 57 3.3 Apache Flink Libraries 3.3.2 Gelly: Flink Graph API https://ci.apache.org/projects/flink/flink-docs-master/libs/gelly_guide.html To use Gelly in a project: • First setup a Flink program: https://ci.apache.org/projects/flink/flink-docs-master/api flink • Add this to the dependencies section of your pom.xml <dependency> <groupId>org.apache.flink</groupId> <artifactId>flink-gelly</artifactId> <version>0.10-SNAPSHOT</version> </dependency>
  58. 58. 58 3.3 Apache Flink Libraries Gelly Examples: https://github.com/apache/flink/tree/master/flink-staging/flink-gelly/src/main/java/org/apache/flink/g example Gelly exercise & solution Gelly API - PageRank on Reply Graph http://dataartisans.github.io/flink-training/exercises/replyGraphGelly.html Gelly is not currently part of the binary distribution. You need to link it for cluster execution: https://ci.apache.org/projects/flink/flink-docs-master/apis/cluster_execution.html#linking-with-mod
  59. 59. 59 4. How to set up your IDE (IntelliJ IDEA or Eclipse) for Apache Flink? 4.1 How to set up your IDE (IntelliJ IDEA)? 4.2 How to setup your IDE (Eclipse)? Flink uses mixed Scala/Java projects, which pose a challenge to some IDEs Minimal requirements for an IDE are: • Support for Java and Scala (also mixed projects) • Support for Maven with Java and Scala
  60. 60. 60 4.1 How to set up your IDE (IntelliJ IDEA)?IntelliJ IDEA supports Maven out of the box and offers a plugin for Scala development. IntelliJ IDEA Download https ://www.jetbrains.com/idea/download/ IntelliJ Scala Plugin http://plugins.jetbrains.com/plugin/?id=1347 Check out Setting up IntelliJ IDEA guide for details https://github.com/apache/flink/blob/master/docs/internals/ide_setup.md# intellij-idea Screencast: Run Apache Flink WordCount from IntelliJ https://www.youtube.com/watch?v=JIV_rX-OIQM
  61. 61. 61 4.2 How to setup your IDE (Eclipse)? • For Eclipse users, Apache Flink committers recommend using Scala IDE 3.0.3, based on Eclipse Kepler. • While this is a slightly older version, they found it to be the version that works most robustly for a complex project like Flink. One restriction is, though, that it works only with Java 7, not with Java 8. • Check out how to setup Eclipse docs: https://github.com/apache/flink/blob/master/docs/internals/ide_setup.md #eclipse
  62. 62. 62 5. How to write, test and debug your Apache Flink program in an IDE? 5.1 How to write a Flink program? 5.1.1 How to generate a Flink project with Maven? 5.1.2 How to import the Flink Maven project into IDE 5.1.3 How to use logging? 5.1.4 FAQs and best practices related to coding 5.2 How to test your Flink program? 5.3 How to debug your Flink program?
  63. 63. 63 5.1 How to write a Flink program in an IDE? The easiest way to get a working setup to develop (and locally execute) Flink programs is to follow the Quick Start guide: https://ci.apache.org/projects/flink/flink-docs-master/quickstart/java_api_quickstart.html https://ci.apache.org/projects/flink/flink-docs-master/quickstart/scala_api_quickstart.html It uses Maven archetype to configure and generate a Flink Maven project. This will save you time dealing with transitive dependencies! This Maven project can be imported into your IDE.
  64. 64. 64 5.1 How to write a Flink program in an IDE? Generate a skeleton project with Maven to get started mvn archetype:generate / -DarchetypeGroupId=org.apache.flink / -DarchetypeArtifactId=flink-quickstart-java / -DarchetypeVersion=0.9.0 you can also put “quickstart-scala” here you can also put “quickstart-scala” here or “0.10-SNAPSHOT”or “0.10-SNAPSHOT”  No need for manually downloading any .tgz or .jar files for now 5.1.1 How to generate a skeleton Flink project with Maven?
  65. 65. 65 5.1 How to write a Flink program in an IDE? 5.1.1 How to generate a skeleton Flink project with Maven? The generated projects are located in a folder called flink-java-project or flink-scala-project. In order to test the generated projects and to download all required dependencies run the following commands (change flink-java-project to flink-scala-project for Scala projects) • cd flink-java-project • mvn clean package Maven will now start to download all required dependencies and build the Flink quickstart project.
  66. 66. 66 5.1 How to write a Flink program in an IDE?5.1.2 How to import the Flink Maven project into IDE The generated Maven project needs to be imported into your IDE: IntelliJ: • Select “File” -> “Import Project” • Select root folder of your project • Select “Import project from external model”, select “Maven” • Leave default options and finish the import Eclipse: • Select “File” -> “Import” -> “Maven” -> “Existing Maven Project” • Follow the import instructions
  67. 67. 67 5.1 How to write a Flink program in an IDE? 5.1.3 How to use logging? The logging in Flink is implemented using the slf4j logging interface. log4j is used as underlying logging framework. Log4j is controlled using property file usually called log4j.properties. You can pass to the JVM the filename and location of this file using the Dlog4j.configuration= parameter. The loggers using slf4j are created by calling import org.slf4j.LoggerFactory import org.slf4j.Logger Logger LOG = LoggerFactory.getLogger(Foobar.class) You can also use logback instead of log4j. https://ci.apache.org/projects/flink/flink-docs-release-0.9/internals/logging.html
  68. 68. 68 5.1 How to write a Flink program? 5.1.4 FAQs & best practices related to coding Errors http://flink.apache.org/faq.html#errors Usage http://flink.apache.org/faq.html#usage Best Practices https://ci.apache.org/projects/flink/flink-docs- master/apis/best_practices.html
  69. 69. 69 5.2 How to test your Flink program in an IDE? Start Flink in your IDE for local development & debugging. final ExecutionEnvironment env = ExecutionEnvironment.createLocalEnvironment();  Use Flink’s testing framework @RunWith(Parameterized.class) class YourTest extends MultipleProgramsTestBase { @Test public void testRunWithConfiguration(){ expectedResult = "1 11n“; } }
  70. 70. 70 5.3 How to debug your Flink program in an IDE? Flink programs can be executed and debugged from within an IDE. This significantly eases the development process and gives a programming experience similar to working on a regular Java application. Starting a Flink program in your IDE is as easy as starting its main()method. Under the hood, the ExecutionEnvironment will start a local Flink instance within the execution process. Hence it is also possible to put breakpoints everywhere in your code and debug it.
  71. 71. 71 5.3 How to debug your Flink program in an IDE? • Assuming you have an IDE with a Flink quickstart project imported, you can execute and debug the example WordCount program which is included in the quickstart project as follows: • Open the org.apache.flink.quickstart.WordCount class in your IDE • Place a breakpoint somewhere in the flatMap() method of the LineSplitter class which is inline defined in the WordCount class. • Execute or debug the main() method of the WordCount class using your IDE.
  72. 72. 72 5.3 How to debug your Flink program in an IDE? When you start a program locally with the LocalExecutor, you can place breakpoints in your functions and debug them like normal Java/Scala programs. The Accumulators are very helpful in tracking the behavior of the parallel execution. They allow you to gather information inside the program’s operations and show them after the program execution.
  73. 73. 73 Debugging with the IDE
  74. 74. 74 Debugging on a cluster Good old system out debugging • Get a logger – Start logging • Start logging private static final Logger LOG = LoggerFactory.getLogger(YourJob.class); LOG.info("elementCount = {}", elementCount); • You can also use System.out.println().
  75. 75. 75 Getting logs on a cluster • Non-YARN (=bare metal installation) –The logs are located in each TaskManager’s log/ directory. –ssh there and read the logs. • YARN –Make sure YARN log aggregation is enabled –Retrieve logs from YARN (once app is finished) $ yarn logs -applicationId <application ID>
  76. 76. 76 Flink Logs 11:42:39,233 INFO org.apache.flink.runtime.jobmanager.JobManager - -------------------------------------------------------------------------------- 11:42:39,233 INFO org.apache.flink.runtime.jobmanager.JobManager - Starting JobManager (Version: 0.9-SNAPSHOT, Rev:2e515fc, Date:27.05.2015 @ 11:24:23 CEST) 11:42:39,233 INFO org.apache.flink.runtime.jobmanager.JobManager - Current user: robert 11:42:39,233 INFO org.apache.flink.runtime.jobmanager.JobManager - JVM: OpenJDK 64-Bit Server VM - Oracle Corporation - 1.7/24.75-b04 11:42:39,233 INFO org.apache.flink.runtime.jobmanager.JobManager - Maximum heap size: 736 MiBytes 11:42:39,233 INFO org.apache.flink.runtime.jobmanager.JobManager - JAVA_HOME: (not set) 11:42:39,233 INFO org.apache.flink.runtime.jobmanager.JobManager - JVM Options: 11:42:39,233 INFO org.apache.flink.runtime.jobmanager.JobManager - -XX:MaxPermSize=256m 11:42:39,233 INFO org.apache.flink.runtime.jobmanager.JobManager - -Xms768m 11:42:39,233 INFO org.apache.flink.runtime.jobmanager.JobManager - -Xmx768m 11:42:39,233 INFO org.apache.flink.runtime.jobmanager.JobManager - -Dlog.file=/home/robert/incubator-flink/build-target/bin/../log/flink-robert-jobmanager-robert-da.log 11:42:39,233 INFO org.apache.flink.runtime.jobmanager.JobManager - -Dlog4j.configuration=file:/home/robert/incubator-flink/build-target/bin/../conf/log4j.properties 11:42:39,233 INFO org.apache.flink.runtime.jobmanager.JobManager - -Dlogback.configurationFile=file:/home/robert/incubator-flink/build-target/bin/../conf/logback.xml 11:42:39,233 INFO org.apache.flink.runtime.jobmanager.JobManager - Program Arguments: 11:42:39,233 INFO org.apache.flink.runtime.jobmanager.JobManager - --configDir 11:42:39,233 INFO org.apache.flink.runtime.jobmanager.JobManager - /home/robert/incubator-flink/build-target/bin/../conf 11:42:39,234 INFO org.apache.flink.runtime.jobmanager.JobManager - --executionMode 11:42:39,234 INFO org.apache.flink.runtime.jobmanager.JobManager - local 11:42:39,234 INFO org.apache.flink.runtime.jobmanager.JobManager - --streamingMode 11:42:39,234 INFO org.apache.flink.runtime.jobmanager.JobManager - batch 11:42:39,234 INFO org.apache.flink.runtime.jobmanager.JobManager - -------------------------------------------------------------------------------- 11:42:39,469 INFO org.apache.flink.runtime.jobmanager.JobManager - Loading configuration from /home/robert/incubator-flink/build-target/bin/../conf 11:42:39,525 INFO org.apache.flink.runtime.jobmanager.JobManager - Security is not enabled. Starting non-authenticated JobManager. 11:42:39,525 INFO org.apache.flink.runtime.jobmanager.JobManager - Starting JobManager 11:42:39,527 INFO org.apache.flink.runtime.jobmanager.JobManager - Starting JobManager actor system at localhost:6123. 11:42:40,189 INFO akka.event.slf4j.Slf4jLogger - Slf4jLogger started 11:42:40,316 INFO Remoting - Starting remoting 11:42:40,569 INFO Remoting - Remoting started; listening on addresses :[akka.tcp://flink@127.0.0.1:6123] 11:42:40,573 INFO org.apache.flink.runtime.jobmanager.JobManager - Starting JobManager actor 11:42:40,580 INFO org.apache.flink.runtime.blob.BlobServer - Created BLOB server storage directory /tmp/blobStore-50f75dc9-3001-4c1b-bc2a-6658ac21322b 11:42:40,581 INFO org.apache.flink.runtime.blob.BlobServer - Started BLOB server at 0.0.0.0:51194 - max concurrent requests: 50 - max backlog: 1000 11:42:40,613 INFO org.apache.flink.runtime.jobmanager.JobManager - Starting embedded TaskManager for JobManager's LOCAL execution mode 11:42:40,615 INFO org.apache.flink.runtime.jobmanager.JobManager - Starting JobManager at akka://flink/user/jobmanager#205521910. 11:42:40,663 INFO org.apache.flink.runtime.taskmanager.TaskManager - Messages between TaskManager and JobManager have a max timeout of 100000 milliseconds 11:42:40,666 INFO org.apache.flink.runtime.taskmanager.TaskManager - Temporary file directory '/tmp': total 7 GB, usable 7 GB (100.00% usable) 11:42:41,092 INFO org.apache.flink.runtime.io.network.buffer.NetworkBufferPool - Allocated 64 MB for network buffer pool (number of memory segments: 2048, bytes per segment: 32768). 11:42:41,511 INFO org.apache.flink.runtime.taskmanager.TaskManager - Using 0.7 of the currently free heap space for Flink managed memory (461 MB). 11:42:42,520 INFO org.apache.flink.runtime.io.disk.iomanager.IOManager - I/O manager uses directory /tmp/flink-io-4c6f4364-1975-48b7-99d9-a74e4edb7103 for spill files. 11:42:42,523 INFO org.apache.flink.runtime.jobmanager.JobManager - Starting JobManger web frontend Build Information JVM details Init messages
  77. 77. 77 Get logs of a running YARN application
  78. 78. 78 Debugging on a cluster - Accumulators Useful to verify your assumptions about the data class Tokenizer extends RichFlatMapFunction<String, String>> { @Override public void flatMap(String value, Collector<String> out) { getRuntimeContext() .getLongCounter("elementCount").add(1L); // do more stuff. } } Use “Rich*Functions” to get RuntimeContextUse “Rich*Functions” to get RuntimeContext
  79. 79. 79 Debugging on a cluster - Accumulators Where can I get the accumulator results? • returned by env.execute() • displayed when executed with /bin/flink • in the JobManager web frontend JobExecutionResult result = env.execute("WordCount"); long ec = result.getAccumulatorResult("elementCount");
  80. 80. 80 Live Monitoring with Accumulators In previous versions to Flink 0.10 • Accumulators only available after Job finishes • In Flink 0.10 • Accumulators updated while Job is running • System accumulators (number of bytes/records processed…)
  81. 81. 81 In Flink 0.10, the Job Manager Web Interface displays the accumulators live in the web interface
  82. 82. 82 Excursion: RichFunctions The default functions are SAMs (Single Abstract Method). Interfaces with one method (for Java8 Lambdas) There is a “Rich” variant for each function. • RichFlatMapFunction, … • Methods  open(Configuration c) & close()  getRuntimeContext()
  83. 83. 83 Excursion: RichFunctions & RuntimeContext The RuntimeContext provides some useful methods getIndexOfThisSubtask () / getNumberOfParallelSubtasks() – who am I, and if yes how many? getExecutionConfig() Accumulators DistributedCache
  84. 84. 84 Attaching a remote debugger to Flink in a Cluster
  85. 85. 85 Attaching a debugger to Flink in a cluster Add JVM start option in flink-conf.yaml env.java.opts: “-agentlib:jdwp=….” Open an SSH tunnel to the machine: ssh -f -N -L 5005:127.0.0.1:5005 user@host Use your IDE to start a remote debugging session
  86. 86. 86 6. How to deploy your Apache Flink application in local, in a cluster or in the cloud? 6.1 Deploy in Local 6.2 Deploy in Cluster 6.3 Deploy in Cloud
  87. 87. 87 6. How to deploy your Apache Flink application in local, in a cluster or in the cloud? 6.1 Deploy in Local Package your job in a jar and submit it: • /bin/flink (Command Line Interface) • RemoteExecutionEnvironment (From a local java app) • Web Frontend (GUI) • Scala Shell
  88. 88. 88 Flink Web Submission Client Select jobs and preview plan Understand Optimizer choices
  89. 89. 89 6.2 Deploy in Cluster • You can start a cluster locally $ tar xzf flink-*.tgz $ cd flink $ bin/start-cluster.sh Starting Job Manager Starting task manager on host $ jps 5158 JobManager 5262 TaskManager
  90. 90. 90 6.3 Deploy in Cloud  Google Compute Engine (GCE) Free trial for Google Cloud Engine: https://cloud.google.com/free-trial/ Enjoy your $300 in GCE for 60 days! http://ci.apache.org/projects/flink/flink-docs-master/setup/gce_setup.html ./bdutil -e extensions/flink/flink_env.sh deploy
  91. 91. 91 6.3 Deploy in Cloud  Amazon EMR or any other cloud provider with preinstalled Hadoop YARN http://ci.apache.org/projects/flink/flink-docs-master/setup/yarn_setup.html wget http://stratosphere-bin.amazonaws.com/flink-0.9-SNAPSHOT-bin- hadoop2.tgz tar xvzf flink-0.9-SNAPSHOT-bin-hadoop2.tgz cd flink-0.9-SNAPSHOT/ ./bin/yarn-session.sh -n 4 -jm 1024 -tm 4096  Install Flink yourself on the machines
  92. 92. 92 7. How to tune your Apache Flink application 7.1 Tuning CPU 7.2 Tuning memory 7.3 Tuning I/O 7.4 Optimizer hints
  93. 93. 93 7. How to tune your Apache Flink application (CPU, Memory, I/O)? 7.1 Tuning CPU  Processing slots, threads, … https://ci.apache.org/projects/flink/flink-docs- master/setup/config.html#configuring-taskmanager-processing-slots
  94. 94. 94 Tell Flink how many CPUs you have taskmanager.numberOfTaskSlots in flink-config.yaml: • number of parallel job instances • number of pipelines per TaskManager recommended: number of available CPU cores MapMap ReduceReduce MapMap ReduceReduce MapMap ReduceReduce MapMap ReduceReduce MapMap ReduceReduce MapMap ReduceReduce MapMap ReduceReduce
  95. 95. 95 Task Manager 1 Slot 1 Slot 2 Slot 3 Task Manager 2 Slot 1 Slot 2 Slot 3 Task Manager 3 Slot 1 Slot 2 Slot 3 Task Managers: 3 Total number of processing slots: 12 flink-config.yaml: taskmanager.numberOfTaskSlots: 4 or /bin/yarn-session.sh –slots 4 –n 4 (Recommended value: Number of CPU cores) Configuring TaskManager Processing slots 3 machines each with 4 CPU cores gives us a total of 12 processing slots Slot 4 Slot 4Slot 4
  96. 96. 96 Task Manager 2 Slot 1 Slot 2 Slot 3 Task Manager 3 Slot 1 Slot 2 Slot 3 When no argument are given, parallelism.default from flink-config.yaml is used. Default value = 1 Example 1: WordCount with parallelism = 1 Task Manager 1 Slot 1 Slot 2 Slot 3 Source -> flatMa p Reduc e Sink Slot 4 Slot 4 Slot 4
  97. 97. 97 Example 2: WordCount with parallelism = 2 Task Manager 1 Slot 1 Slot 2 Slot 3 Task Manager 2 Slot 1 Slot 2 Slot 3 Task Manager 3 Slot 1 Slot 2 Slot 3 Source -> flatMap Reduce Sink Source -> flatMap Reduce Sink Places to set parallelism for a job  flink-config.yaml parallelism.default: 2  Flink Client:./bin/flink -p 2  ExecutionEnvironment: env.setParallelism(2) Slot 4 Slot 4 Slot 4
  98. 98. 98 Example 3: WordCount with parallelism = 12 (using all resources) Task Manager 1 Slot 1 Slot 2 Slot 3 Source -> flatMap Reduce Sink Source -> flatMap Reduce Sink Source -> flatMap Reduce Sink Task Manager 2 Slot 1 Slot 2 Slot 3 Task Manager 3 Slot 1 Slot 2 Slot 3 Source -> flatMap Reduce Sink Source -> flatMap Reduce Sink Source -> flatMap Reduce Sink Source -> flatMap Reduce Sink Source -> flatMap Reduce Sink Reduce Sink Source -> flatMap
  99. 99. 99 Example 4: WordCount with parallelism = 12 and sink parallelism = 1 Task Manager 1 Slot 1 Slot 2 Slot 3 Task Manager 2 Slot 1 Slot 2 Slot 3 Task Manager 3 Slot 1 Slot 2 Slot 3 Source -> flatMap Reduce Source -> flatMap Reduce Source -> flatMap Reduce Source -> flatMap Reduce Source -> flatMap Reduce Source -> flatMap Reduce Source -> flatMap Reduce Source -> flatMap Reduce Source -> flatMap Reduce The parallelism of each operator can be set individually in the APIs counts.writeAsCsv(outputPath, "n", " ").setParallelism(1); Sink The data is streamed to this Sink from all the other slots on the other TaskManagers Slot 4 Slot 4 Slot 4 Source -> flatMap Reduce Source -> flatMap Reduce Source -> flatMap Reduce
  100. 100. 100 7. How to tune your Apache Flink application (CPU, Memory, I/O)? 7.2 Tuning Memory How to adjust memory usage on the TaskManager?
  101. 101. 101 Memory in Flink - Theory Memory Management (Batch API) https://cwiki.apache.org/confluence/pages/viewpage.action? pageId=53741525
  102. 102. 102 taskmanager.network.numberOfBufferstaskmanager.network.numberOfBuffers relative: taskmanager.memory.fraction absolute: taskmanager.memory.size relative: taskmanager.memory.fraction absolute: taskmanager.memory.size Memory in Flink - Configuration taskmanager.heap.mb or „-tm“ argument for bin/yarn-session.sh taskmanager.heap.mb or „-tm“ argument for bin/yarn-session.sh
  103. 103. 103 Memory in Flink - OOM 2015-02-20 11:22:54 INFO JobClient:345 - java.lang.OutOfMemoryError: Java heap space at org.apache.flink.runtime.io.network.serialization.DataOutputSerializer.resize(DataOutputSerializer.java:249) at org.apache.flink.runtime.io.network.serialization.DataOutputSerializer.write(DataOutputSerializer.java:93) at org.apache.flink.api.java.typeutils.runtime.DataOutputViewStream.write(DataOutputViewStream.java:39) at com.esotericsoftware.kryo.io.Output.flush(Output.java:163) at com.esotericsoftware.kryo.io.Output.require(Output.java:142) at com.esotericsoftware.kryo.io.Output.writeBoolean(Output.java:613) at com.twitter.chill.java.BitSetSerializer.write(BitSetSerializer.java:42) at com.twitter.chill.java.BitSetSerializer.write(BitSetSerializer.java:29) at com.esotericsoftware.kryo.Kryo.writeClassAndObject(Kryo.java:599) at org.apache.flink.api.java.typeutils.runtime.KryoSerializer.serialize(KryoSerializer.java:155) at org.apache.flink.api.scala.typeutils.CaseClassSerializer.serialize(CaseClassSerializer.scala:91) at org.apache.flink.api.scala.typeutils.CaseClassSerializer.serialize(CaseClassSerializer.scala:30) at org.apache.flink.runtime.plugable.SerializationDelegate.write(SerializationDelegate.java:51) at org.apache.flink.runtime.io.network.serialization.SpanningRecordSerializer.addRecord(SpanningRecordSerializer.j ava:76) at org.apache.flink.runtime.io.network.api.RecordWriter.emit(RecordWriter.java:82) at org.apache.flink.runtime.operators.shipping.OutputCollector.collect(OutputCollector.java:88) at org.apache.flink.api.scala.GroupedDataSet$$anon$2.reduce(GroupedDataSet.scala:262) at org.apache.flink.runtime.operators.GroupReduceDriver.run(GroupReduceDriver.java:124) at org.apache.flink.runtime.operators.RegularPactTask.run(RegularPactTask.java:493) at org.apache.flink.runtime.operators.RegularPactTask.invoke(RegularPactTask.java:360) at org.apache.flink.runtime.execution.RuntimeEnvironment.run(RuntimeEnvironment.java:257) at java.lang.Thread.run(Thread.java:745) Memory is missing here Memory is missing here Reduce managed memory reduce taskmanager. memory.fraction Reduce managed memory reduce taskmanager. memory.fraction
  104. 104. 104 Memory in Flink – Network buffers Memory is missing here Memory is missing here Managed memory will shrink automatically Managed memory will shrink automatically Error: java.lang.Exception: Failed to deploy the task CHAIN Reduce(org.okkam.flink.maintenance.deduplication.blocking.Remove DuplicateReduceGroupFunction) -> Combine(org.apache.flink.api.java.operators.DistinctOperator$Distinc tFunction) (15/28) - execution #0 to slot SubSlot 5 (cab978f80c0cb7071136cd755e971be9 (5) - ALLOCATED/ALIVE): org.apache.flink.runtime.io.network.InsufficientResourcesException: okkam-nano-2.okkam.it has not enough buffers to safely execute CHAIN Reduce(org.okkam.flink.maintenance.deduplication.blocking.Remove DuplicateReduceGroupFunction) -> Combine(org.apache.flink.api.java.operators.DistinctOperator$Distinc tFunction) (36 buffers missing) Increase taskmanager.network.numberOfBuffers Increase taskmanager.network.numberOfBuffers
  105. 105. 105 What are these buffers needed for? TaskManager 1 Slot 2 MapMap ReduceReduce Slot 1 TaskManager 2 Slot 2 Slot 1 A small Flink cluster with 4 processing slots (on 2 Task Managers) A simple MapReduce Job in Flink:
  106. 106. 106 What are these buffers needed for? Map Reduce job with a parallelism of 2 and 2 processing slots per Machine TaskManager 1 TaskManager 2 Slot1Slot2 MapMap MapMap ReduceReduce ReduceReduce MapMap MapMap ReduceReduce ReduceReduce MapMap MapMap ReduceReduce ReduceReduce MapMap MapMap ReduceReduce ReduceReduce Slot1Slot2 Network bufferNetwork buffer 8 buffers for outgoing data 8 buffers for outgoing data 8 buffers for incoming data8 buffers for incoming data
  107. 107. 107 What are these buffers needed for? Map Reduce job with a parallelism of 2 and 2 processing slots per Machine TaskManager 1 TaskManager 2 Slot1Slot2 MapMap MapMap ReduceReduce ReduceReduce MapMap MapMap ReduceReduce ReduceReduce MapMap MapMap ReduceReduce ReduceReduce MapMap MapMap ReduceReduce ReduceReduce Each mapper has a logical connection to a reducer Each mapper has a logical connection to a reducer
  108. 108. 108 7. How to tune your Apache Flink application (CPU, Memory, I/O)? 7.3 Tuning I/O Specifying temporary directories for spilling
  109. 109. 109 Disk I/O Sometimes your data doesn’t fit into main memory, so we have to spill to disk: taskmanager.tmp.dirs: /mnt/disk1,/mnt/disk2 Use real local disks only (no tmpfs or NAS) Reader Thread Reader Thread Disk 1Disk 1 Writer Thread Writer Thread Reader Thread Reader Thread Writer Thread Writer Thread Disk 2Disk 2 Task Manager
  110. 110. 110 7. How to tune your Apache Flink application 7.4 Optimizer hints  Examples: DataSet.join(DataSet other, JoinHint.BROADCAST_HASH_SECOND) DataSet.join(DataSet other, JoinHint.BROADCAST_HASH_FIRST) http://stackoverflow.xluat.com/questions/31484856/the- difference-and-benefit-of-joinwithtiny-joinwithhuge-and-joinhint
  111. 111. 111  Consider attending the first dedicated Apache Flink conference on October 12-13, 2015 in Berlin, Germany! http://flink-forward.org/ Two parallel tracks: Talks: Presentations and use cases Trainings: 2 days of hands on training workshops by the Flink committers

×