Successfully reported this slideshow.
We use your LinkedIn profile and activity data to personalize ads and to show you more relevant ads. You can change your ad preferences anytime.
hosted by
Scaling 30 TB’s of Data Lake
with Apache HBase and Scala
DSL at Production
Chetan Khatri
hosted by
Who Am I
Lead - Data Science, Technology Evangelist @ Accion labs India Pvt. Ltd.
Contributor @ Apache Spark, Ap...
hosted by
Agenda 01
02
04
03
What is Apache HBase
Why Apache HBase
Apache Spark and Scala
Apache Spark HBase Connector
05 ...
hosted by
Add the title
•  Modules do not limit the size,
number, interval, can be adjusted
according to need
•  Modules d...
hosted by
What is Apache Spark ?
Source: https://spark.apache.org/
Structured Data / SQL -
Spark SQL
Graph Processing -
Gr...
hosted by
What is Scala
●  Scala is a modern multi-paradigm programming language
designed to express common programming pa...
hosted by
Case Story: Retail Analytics
Architecting Fast Data Processing Platform to Scale 30 TB of Data in Production
Use...
hosted by
Case Story: Retail Analytics - Scale
Challenges
²  Weekly Data refresh, Daily Data refresh batch Spark / Hadoop...
hosted by
Using HBase as a MDM System
MDM - Master Data Management
1. HBase Driven Data Deduplication Algorithms
Example,
...
hosted by
Using HBase as a MDM System
Data Deduplication Problem Retail Analytics !
Examples,
●  You may find Item with sa...
hosted by
HBase - NoSQL, Denormalized columnar schema model
For example,
Table: transactional_line_item
Column Family: f C...
hosted by
Case Story: Retail Analytics - Scale
-  5x performance improvements by re-engineering entire data lake to analy...
hosted by
Case Story: Retail Analytics
How ?
hosted by
Data Platform
- Infrastructure Architecture
Oracle
POS Files
Kafka
Spark
Streaming
HBase
Hive / Spark /
Presto
S...
hosted by
AsyncHBase Build.sbt with Akka HTTP
hosted by
Data Processing Infrastructure
●  MapR Distribution - http://archive.mapr.com/releases/
ecosystem-5.x/redhat/
● ...
hosted by
Rethink
- Fast Data Architectures. Unify, Simplify.
UNIFIED fast data processing engine that provides:
The
SCALE...
hosted by
Spark HBase Connector.
Credit: Contributors
https://github.com/nerdammer/spark-hbase-
connector
hosted by
Spark HBase Connector.
It’s a Spark package connector written on top of Java HBase API. A
simple and elegant way...
hosted by
Spark HBase Connector.
Dependency in build.sbt - libraryDependencies += "it.nerdammer.bigdata" % "spark-hbase-co...
hosted by
Writing to HBase
hosted by
Reading from HBase
hosted by
Filtering
hosted by
Manage Empty Columns with Option[T]
hosted by
Custom Mapping with Case Classes
hosted by
Custom Mapping with Case Classes ...
hosted by
Implicit Reader
hosted by
Implicit Reader ...
Do not forget to override the columns method.
hosted by
HBase Read table Data in Spark DataFrame
hosted by
HBase Implicit Field Writer
hosted by
Save DataFrame to HBase
hosted by
Reference
[1] Apache HBase – Apache HBase™ Home
URL: https://hbase.apache.org/
[2] Architecting HBase Applicatio...
hosted by
Reference
[10] Kafka Clients
URL: https://mvnrepository.com/artifact/org.apache.kafka/kafka-clients
[11] Spray J...
hosted by
Thanks
Upcoming SlideShare
Loading in …5
×

HBaseConAsia2018 Track2-6: Scaling 30TB's of data lake with Apache HBase and Scala DSL in Production

326 views

Published on

Chetankumar Jyestaram Khattri

Published in: Internet
  • Be the first to comment

  • Be the first to like this

HBaseConAsia2018 Track2-6: Scaling 30TB's of data lake with Apache HBase and Scala DSL in Production

  1. 1. hosted by Scaling 30 TB’s of Data Lake with Apache HBase and Scala DSL at Production Chetan Khatri
  2. 2. hosted by Who Am I Lead - Data Science, Technology Evangelist @ Accion labs India Pvt. Ltd. Contributor @ Apache Spark, Apache HBase, Elixir Lang, Spark HBase Connectors. Co-Authored University Curriculum @ University of Kachchh, India. Data Engineering @: Nazara Games, Eccella Corporation. Advisor - Data Science Lab, University of Kachchh, India. M.Sc. - Computer Science from University of Kachchh, India.
  3. 3. hosted by Agenda 01 02 04 03 What is Apache HBase Why Apache HBase Apache Spark and Scala Apache Spark HBase Connector 05 Case Study: Retail Analytics Architecturing Fast Data Processing Platform to Scale 30 TB Data in Production
  4. 4. hosted by Add the title •  Modules do not limit the size, number, interval, can be adjusted according to need •  Modules do not limit the size, number, interval, can be adjusted according to need •  Modules do not limit the size, number, interval, can be adjusted according to need Source: https://hbase.apache.org/ ●  Column-oriented NoSQL ●  Non-relational ●  Distributed database build on top of HDFS. ●  Modeled after Google’s BigTable. ●  Built for fault-tolerant application with billions/ trillions of rows and millions of columns. ●  Very low latency and near real-time random reads and random writes. ●  Replication, end-to-end checksums, automatic rebalancing with HDFS. ●  Compression ●  Bloom filters ●  MapReduce over HBase data. ●  Best at fetching rows by key, scanning ranges of rows with ordered partitioning. What is Apache HBase
  5. 5. hosted by What is Apache Spark ? Source: https://spark.apache.org/ Structured Data / SQL - Spark SQL Graph Processing - GraphX Machine Learning - MLlib Streaming - Spark Streaming, Structured Streaming
  6. 6. hosted by What is Scala ●  Scala is a modern multi-paradigm programming language designed to express common programming patterns in a concise, elegant, and type-safe way. ●  Scala is object-oriented ●  Scala is functional ●  Strongly typed, Type Inference ●  Higher Order Functions ●  Lazy Computation Source: www.scala-lang.org
  7. 7. hosted by Case Story: Retail Analytics Architecting Fast Data Processing Platform to Scale 30 TB of Data in Production Use cases in Retail Analytics: Business: explain the who, what, when, where, why and how they are doing Retailing. ●  What is selling as compared to what was being ordered. ●  Effective promotions - right promotions at right outlet and right time. ●  What types of Cigarette consumers are shopping in your outlets ? ○  Gives smoking patterns in specific geography, predict demand on supply. ●  What are the purchasing patterns of your consumers ? ○  are they purchasing Pizza and Ice cream together ? ○  are they purchasing multiple Instant food products with soda together ? ●  Time Series problem - year, month, day of year, week of year to Identify which brands are not getting sold at specific geography, so it can be swap to other store.
  8. 8. hosted by Case Story: Retail Analytics - Scale Challenges ²  Weekly Data refresh, Daily Data refresh batch Spark / Hadoop job execution failures with unutilized Spark Cluster. ²  Scalability of massive data: ○  ~4.6 Billion events on every weekly data refresh. ○  Processing historical data: one time bulk load ~30 TB of data / ~17 billion transactional records. ²  Linear / sequential execution mode with broken data pipelines. ²  Joining ~17 billion transactional records with skewed data nature. ²  Data deduplication - outlet, item at retail.
  9. 9. hosted by Using HBase as a MDM System MDM - Master Data Management 1. HBase Driven Data Deduplication Algorithms Example, ²  Outlet Matching ²  Item Matching ²  Address Matching ²  Brand Matching 2. Abbreviation Standardization Example, ²  UOM Standardization ²  Outlet Name, Address Standardization ²  UPC Standardization UOM Quantity PACK 2 2PACK NA 2PK
  10. 10. hosted by Using HBase as a MDM System Data Deduplication Problem Retail Analytics ! Examples, ●  You may find Item with same UPC code. ●  You may find Outlet with same Outlet number. ●  What if UPC Code gets upgraded from 10 Digits to 14 Digits. (Update everywhere, needs faster update.)
  11. 11. hosted by HBase - NoSQL, Denormalized columnar schema model For example, Table: transactional_line_item Column Family: f Column Family: o Column Family: t created_datetime file_id transaction_id manufacturer_operator_submitter_id Outlet_id Outlet_name Outlet_state Outlet_city Outlet_address1 Outlet_address2 Outlet_region Outlet_country Outlet_owner_name Outlet_status Outlet_zipcode Outlet_started_date outlet_sub_chain_name Transaction_id Item_id Promo_code Gross_price Discount Quantity Upc Uom State_gst Country_gst Delivery_charges Packing_charges Vendor_discount Partner_discount Corporate_discount reward_point_discount
  12. 12. hosted by Case Story: Retail Analytics - Scale -  5x performance improvements by re-engineering entire data lake to analytical engine pipeline’s. -  Proposed highly concurrent, elastic, non-blocking, asynchronous architecture to save customer’s ~22 hours runtime (~8 hours from 30 hours) for 4.6 Billion events. -  10x performance improvements on historical load by under the hood algorithms optimization on 17 Billion events (Benchmarks ~1 hour execution time only) -  Master Data Management (MDM) - Deduplication and fuzzy logic matching on retails data(Item, Outlet) improved elastic performance. -  Using HBase as a Master Data Management (MDM) System.
  13. 13. hosted by Case Story: Retail Analytics How ?
  14. 14. hosted by Data Platform - Infrastructure Architecture Oracle POS Files Kafka Spark Streaming HBase Hive / Spark / Presto Spark MLLib TensorFlow Elastic Search Akka-HTTP HDFS Real time query HBase Staging Hive Aggregation Hive Layer Read Rec by Rec and do match [2] maprcli Index and setup replication [1] Asynchronous HBase: https://github.com/OpenTSDB/asynchbase [2] Index MapR-DB Data into Elasticsearch https://community.mapr.com/community/exchange/blog/ 2016/12/12/how-to-index-mapr-db-data-into-elasticsearch-on-aws NodeJS [3] NodeJS HBase https://www.npmjs.com/package/ node-thrift2-hbase https://www.npmjs.com/package/ async https://www.npmjs.com/package/ thrift
  15. 15. hosted by AsyncHBase Build.sbt with Akka HTTP
  16. 16. hosted by Data Processing Infrastructure ●  MapR Distribution - http://archive.mapr.com/releases/ ecosystem-5.x/redhat/ ●  Data Lake - Apache HBase 0.98.12 ●  EDW / Analytical Data store - Apache Hive 1.2.1 ●  Unified Execution Engine - Apache Spark 2.0.1 ●  Distributed File storage - MapR-FS ●  Queueing mechanism - Apache Kafka ●  Streaming - HTTP Akka + scala 2.11 , Spark Streaming ●  Reporting Database - PostgreSQL 9.x ●  Legacy Database - Oracle 9 ●  Workflow Management tool - BMC Control M
  17. 17. hosted by Rethink - Fast Data Architectures. Unify, Simplify. UNIFIED fast data processing engine that provides: The SCALE of data lake The RELIABILITY & PERFORMANCE of data warehouse The LOW LATENCY of streaming
  18. 18. hosted by Spark HBase Connector. Credit: Contributors https://github.com/nerdammer/spark-hbase- connector
  19. 19. hosted by Spark HBase Connector. It’s a Spark package connector written on top of Java HBase API. A simple and elegant way to write Spark - HBase Jobs. Powerful Functional Scala DSL integrated for Apache Spark. Supports: ●  Scala > 2.10 ●  Spark > 1.6 ●  HBase > 1.0
  20. 20. hosted by Spark HBase Connector. Dependency in build.sbt - libraryDependencies += "it.nerdammer.bigdata" % "spark-hbase-connector_2.10" % "1.0.3" Setting the HBase Host
  21. 21. hosted by Writing to HBase
  22. 22. hosted by Reading from HBase
  23. 23. hosted by Filtering
  24. 24. hosted by Manage Empty Columns with Option[T]
  25. 25. hosted by Custom Mapping with Case Classes
  26. 26. hosted by Custom Mapping with Case Classes ...
  27. 27. hosted by Implicit Reader
  28. 28. hosted by Implicit Reader ... Do not forget to override the columns method.
  29. 29. hosted by HBase Read table Data in Spark DataFrame
  30. 30. hosted by HBase Implicit Field Writer
  31. 31. hosted by Save DataFrame to HBase
  32. 32. hosted by Reference [1] Apache HBase – Apache HBase™ Home URL: https://hbase.apache.org/ [2] Architecting HBase Applications: A Guidebook for successful development and design by Jean-Marc Spaggiari & Kevin O’Dell. [3] AsyncHBase URL: https://github.com/OpenTSDB/asynchbase [4] Spark HBase Connector URL: https://github.com/nerdammer/spark-hbase-connector [5] NodeJS Thirft2 HBase package URL: https://www.npmjs.com/package/node-thrift2-hbase [6] NodeJS Async URL: https://www.npmjs.com/package/async [7] Akka Actor URL: https://mvnrepository.com/artifact/com.typesafe.akka/akka-actor [8] Akka HTTP Core URL: https://mvnrepository.com/artifact/com.typesafe.akka/akka-http-core_2.11/10.0.1 [9] Akka HTTP Spray JSON URL: https://mvnrepository.com/artifact/com.typesafe.akka/akka-http-spray-json-experimental
  33. 33. hosted by Reference [10] Kafka Clients URL: https://mvnrepository.com/artifact/org.apache.kafka/kafka-clients [11] Spray JSON URL: https://mvnrepository.com/artifact/io.spray/spray-json [12] Spray JSON Shapeless URL: https://mvnrepository.com/artifact/com.github.fommil/spray-json-shapeless [13] Scalamock scalatest URL: https://mvnrepository.com/artifact/org.scalamock/scalamock-scalatest-support
  34. 34. hosted by Thanks

×