Successfully reported this slideshow.
We use your LinkedIn profile and activity data to personalize ads and to show you more relevant ads. You can change your ad preferences anytime.

hbaseconasia2017: HBase on Beam

498 views

Published on

Jingcheng Du

Apache Beam is an open source and unified programming model for defining batch and streaming jobs that run on many execution engines, HBase on Beam is a connector that allows Beam to use HBase as a bounded data source and target data store for both batch and streaming data sets. With this connector HBase can work with many batch and streaming engines directly, for example Spark, Flink, Google Cloud Dataflow, etc. In this session, I will introduce Apache Beam, and the current implementation of HBase on Beam and the future plan on this.

hbaseconasia2017 hbasecon hbase
https://www.eventbrite.com/e/hbasecon-asia-2017-tickets-34935546159#

Published in: Technology
  • Be the first to comment

hbaseconasia2017: HBase on Beam

  1. 1. HBase on Beam
  2. 2. Apache Beam u Apache Beam is an open source, unified programming model for defining both batch and streaming data-parallel processing pipelines. u It was initialized and contributed by Google. u Published the first stable release on May 17, 2017.
  3. 3. Apache Beam https://beam.apache.org/images/beam_architecture.png
  4. 4. Apache Beam u A unified model for batch and streaming applications. u Runners for famous open-source batch and streaming engines, for instance Spark and Flink. u Multi-languages are available for end users to build their own pipelines, now Java and Python are supported. u Implement once, run almost everywhere.
  5. 5. Apache Beam u Pipeline: The processing pipeline which includes data input, transform and output. u PCollection: The representation for both bounded and unbounded data u Transform u ParDo u GroupByKey u Combine u Flatten u …
  6. 6. Data Sources u In-memory data: Array, Collection, Map u Text u HDFS u Kafka u HBase u …
  7. 7. Windowing u Fixed time windows u Sliding time windows u Session windows u Single global window
  8. 8. Serialization u Every Transform must be serializable! u CustomCoder u Register coder for classes u Register coder for the output of transform u Serializable
  9. 9. Example: Count the Words https://beam.apache.org/images/wordcount-pipeline.png
  10. 10. Examples: Count the Words
  11. 11. Capability Matrix https://beam.apache.org/documentation/runners/capability-matrix/
  12. 12. HBase + Beam u Inspired by HBase + Spark u Similar functions, Beam SQL is not supported yet. u Use HBase as a bounded data source, and a target data store in both batch and streaming applications u Customized Transforms for HBase bulk operations, and HBasePipelineFunctions as the entry to start the pipeline.
  13. 13. Operations u Operations for both batch and streaming manners u Scan (Already implemented in Beam) u BulkGet u BulkPut u BulkDelete u MapPartitions u ForeachPartition u BulkLoad u BulkLoadThinRows
  14. 14. Examples: Scan u Read data from HBase table by scan
  15. 15. Examples: BulkGet u Implement MakeFunctions to convert input to Get, and convert Result to output
  16. 16. Examples: BulkPut u Implement MakeFunction to convert input to Put.
  17. 17. Examples: BulkDelete u Implement MakeFunction to convert input to Delete.
  18. 18. Examples: MapPartitions
  19. 19. Examples: MapPartitions
  20. 20. Examples: ForeachPartition
  21. 21. Examples: BulkLoad u Implement MakeFunction to convert each input into a Cell.
  22. 22. Examples: BulkLoad
  23. 23. Example: BulkLoadThinRows u Implement MakeFunctions to convert each input into row keys and cells.
  24. 24. Example: BulkLoadThinRows
  25. 25. Future u Contribute the code to Apache Beam u Support Beam SQL in HBase
  26. 26. Thank You!

×