Successfully reported this slideshow.
We use your LinkedIn profile and activity data to personalize ads and to show you more relevant ads. You can change your ad preferences anytime.

Introduction to Apache Beam (incubating) - DataCamp Salzburg - 7 dec 2016


Published on

Session at DataCamp Salzburg:

Published in: Engineering

Introduction to Apache Beam (incubating) - DataCamp Salzburg - 7 dec 2016

  1. 1. Sergio Fernández Redlink GmbH December 7, 2016 - DataCamp Salzburg (incubating) Introduction to
  2. 2. THE decision Gearpump Google Dataflow streams
  3. 3. Apache Beam is a unified and agnostic (batch+stream) programming model designed to provide efficient and portable data processing pipelines
  4. 4. Some bits of history...
  5. 5. Beam
  6. 6. Beam Programming Model: abstract stack SDK DSL Beam Pipeline Construction Runner Beam Fn Runners Execution
  7. 7. Beam Programming Model: concrete stack Java SDK scio Beam Pipeline Construction Flink Runner Beam Fn Runners Execution 1 Python SDK x SDK Apex Runner Dataflow Runner Spark Runner Direct Runner Execution N
  8. 8. Beam Capability Matrix
  9. 9. Beam Model API in a nutshell ● Pipeline: a data processing job as a directed graph of steps ● PCollection: a parallel collection of timestamped elements that are in windows ● IO: produce/consume PCollections from/to outside the pipeline ● Transforms, for instance: ○ ParDo: flatmap over elements of a PCollection ○ (Co)GroupByKey: shuffle & group {{K: V}} → {K: [V]} ○ Side inputs: global view of a PCollection used for broadcast / joins
  10. 10. Options options = PipelineOptionsFactory.fromArgs(args) .withValidation().as(Options.class); Pipeline pipeline = Pipeline.create(options); pipeline.apply("ReadLines", TextIO.Read.from(options.getInput())) .apply(new CountWords()) .apply(MapElements.via(new FormatAsTextFn())) .apply("WriteCounts",;; Writing a basic Beam Pipeline
  11. 11. Run your Pipeline: Direct Runner mvn compile exec:java -Dexec.mainClass=io.redlink.datacamp.beam.WordCount -Dexec.args="--inputFile=../input.txt --output=target/direct/counts" -Pdirect-runner
  12. 12. Run your Pipeline: Spark mvn compile exec:java -Dexec.mainClass=io.redlink.datacamp.beam.WordCount -Dexec.args="--runner=SparkRunner --inputFile=input.txt --output=target/spark/counts" -Pspark-runner
  13. 13. Run your Pipeline: Flink mvn package exec:java -Dexec.mainClass=io.redlink.datacamp.beam.WordCount -Dexec.args="--runner=FlinkRunner --inputFile=input.txt --output=target/flink/counts" -Pflink-runner
  14. 14. Vielen Dank Sergio Fernández Software Engineer Redlink GmbH Work partially funded by SSIX, a European Union’s Horizon 2020 project (grant agreement no. 645425)