Successfully reported this slideshow.
We use your LinkedIn profile and activity data to personalize ads and to show you more relevant ads. You can change your ad preferences anytime.

Introduction to Apache Beam (incubating) - DataCamp Salzburg - 7 dec 2016

557 views

Published on

Session at DataCamp Salzburg: https://www.meetup.com/Salzburg-Big-Data-Meetup/events/231844168/

Published in: Engineering

Introduction to Apache Beam (incubating) - DataCamp Salzburg - 7 dec 2016

  1. 1. Sergio Fernández Redlink GmbH December 7, 2016 - DataCamp Salzburg (incubating) Introduction to
  2. 2. THE decision http://thenewstack.io/apache-streaming-projects-exploratory-guide/ https://twitter.com/ianhellstrom/status/710917506412716033 Gearpump Google Dataflow https://www.flickr.com/photos/somewhatfrank/7152104387/ streams
  3. 3. Apache Beam is a unified and agnostic (batch+stream) programming model designed to provide efficient and portable data processing pipelines
  4. 4. Some bits of history...
  5. 5. Beam http://bitmin.net/blog/what-is-google-cloud-dataflow/
  6. 6. Beam Programming Model: abstract stack SDK DSL Beam Pipeline Construction Runner Beam Fn Runners Execution
  7. 7. Beam Programming Model: concrete stack Java SDK scio Beam Pipeline Construction Flink Runner Beam Fn Runners Execution 1 Python SDK x SDK Apex Runner Dataflow Runner Spark Runner Direct Runner Execution N
  8. 8. Beam Capability Matrix https://beam.incubator.apache.org/documentation/runners/capability-matrix/
  9. 9. Beam Model API in a nutshell ● Pipeline: a data processing job as a directed graph of steps ● PCollection: a parallel collection of timestamped elements that are in windows ● IO: produce/consume PCollections from/to outside the pipeline ● Transforms, for instance: ○ ParDo: flatmap over elements of a PCollection ○ (Co)GroupByKey: shuffle & group {{K: V}} → {K: [V]} ○ Side inputs: global view of a PCollection used for broadcast / joins https://beam.apache.org/documentation/programming-guide/
  10. 10. Options options = PipelineOptionsFactory.fromArgs(args) .withValidation().as(Options.class); Pipeline pipeline = Pipeline.create(options); pipeline.apply("ReadLines", TextIO.Read.from(options.getInput())) .apply(new CountWords()) .apply(MapElements.via(new FormatAsTextFn())) .apply("WriteCounts", TextIO.Write.to(options.getOutput())); pipeline.run(); Writing a basic Beam Pipeline
  11. 11. Run your Pipeline: Direct Runner mvn compile exec:java -Dexec.mainClass=io.redlink.datacamp.beam.WordCount -Dexec.args="--inputFile=../input.txt --output=target/direct/counts" -Pdirect-runner http://beam.incubator.apache.org/get-started/quickstart/
  12. 12. Run your Pipeline: Spark mvn compile exec:java -Dexec.mainClass=io.redlink.datacamp.beam.WordCount -Dexec.args="--runner=SparkRunner --inputFile=input.txt --output=target/spark/counts" -Pspark-runner http://beam.incubator.apache.org/get-started/quickstart/#runner-spark
  13. 13. Run your Pipeline: Flink mvn package exec:java -Dexec.mainClass=io.redlink.datacamp.beam.WordCount -Dexec.args="--runner=FlinkRunner --inputFile=input.txt --output=target/flink/counts" -Pflink-runner http://beam.incubator.apache.org/get-started/quickstart/#runner-flink
  14. 14. Vielen Dank Sergio Fernández Software Engineer https://www.wikier.org/ Redlink GmbH http://redlink.co Work partially funded by SSIX, a European Union’s Horizon 2020 project (grant agreement no. 645425)

×