Successfully reported this slideshow.
We use your LinkedIn profile and activity data to personalize ads and to show you more relevant ads. You can change your ad preferences anytime.

Apache Beam @ GCPUG.TW Flink.TW 20161006

203 views

Published on

Introduce to Apache Beam
Dive in to Beam's architecture and live demo running data pipeline on different runners such as Google Dataflow, Flink and Spark

Published in: Software
  • Be the first to comment

Apache Beam @ GCPUG.TW Flink.TW 20161006

  1. 1. Apache Beam in Data Pipeline Randy Huang
 2016/10/06
  2. 2. Who am I • Data Architect @ VMFive • Fluentd/Embulk fans
  3. 3. Overview • Define Data Pipeline • Architecture • How to write Beam • Demo
  4. 4. Data Pipeline Input Algorithm Output
  5. 5. Why Apache Beam?
  6. 6. Data Pipeline’s world is chaos
  7. 7. Goal • Provide an abstraction layer between data processing’s code and the execution runtime. • Batch processing and Streaming Jobs in one world. • Beam SDK open the door to write once, run anywhere.* on-premise and non-Google cloud
  8. 8. Supported Runners • Google Cloud Dataflow (Block/Non-Blocking) • Apache Flink 1.1.2 • Apache Spark 1.6.2 Hadoop 2.2.0 Kafka 0.8.2.1
  9. 9. API, model, and engine
  10. 10. Architecture • Pipelines • Translators • Runners
  11. 11. programming tips/ Flink • Use the Flink DataStream API in Java and Scala • Use the Beam API directly in Java (and soon Python) with the Flink runner
  12. 12. SDK • Four Parts : • Pipeline : Streaming & Batch Processing • PCollection • Transform • I/O : Source & Sink
  13. 13. for Flink user • we encourage users to use either of the Beam or Flink APIs to implement their Flink jobs for stream data processing. • But Native Flink API - • backwards-compatible API • built-in libraries (e.g., CEP and upcoming SQL) • key-value state (with the ability to query that state in the future) http://data-artisans.com/why-apache-beam/
  14. 14. Demo • GDELT project • EventCount by Location Pileline
  15. 15. Recap • Write the general data pipeline, and choose your runner
  16. 16. Next… • New Runners, SDK (python still dev) • DSL
  17. 17. Another things • BigQuery have DML support!!! https://goo.gl/ lcZQVZ • DataStudio Beta in Taiwan is available • Embulk • Fluentd v0.14.6 - 2016/09/07
  18. 18. forward secure
  19. 19. remember to setup nginx

×