SlideShare a Scribd company logo
hosted by
Scaling 30 TB’s of Data Lake
with Apache HBase and Scala
DSL at Production
Chetan Khatri
hosted by
Who Am I
Lead - Data Science, Technology Evangelist @ Accion labs India Pvt. Ltd.
Contributor @ Apache Spark, Apache HBase, Elixir Lang, Spark HBase
Connectors.
Co-Authored University Curriculum @ University of Kachchh, India.
Data Engineering @: Nazara Games, Eccella Corporation.
Advisor - Data Science Lab, University of Kachchh, India.
M.Sc. - Computer Science from University of Kachchh, India.
hosted by
Agenda 01
02
04
03
What is Apache HBase
Why Apache HBase
Apache Spark and Scala
Apache Spark HBase Connector
05 Case Study: Retail Analytics
Architecturing Fast Data Processing Platform to
Scale 30 TB Data in Production
hosted by
Add the title
•  Modules do not limit the size,
number, interval, can be adjusted
according to need
•  Modules do not limit the size,
number, interval, can be adjusted
according to need
•  Modules do not limit the size,
number, interval, can be adjusted
according to need
Source: https://hbase.apache.org/
●  Column-oriented NoSQL
●  Non-relational
●  Distributed database build on top of HDFS.
●  Modeled after Google’s BigTable.
●  Built for fault-tolerant application with billions/
trillions of rows and millions of columns.
●  Very low latency and near real-time random
reads and random writes.
●  Replication, end-to-end checksums,
automatic rebalancing with HDFS.
●  Compression
●  Bloom filters
●  MapReduce over HBase data.
●  Best at fetching rows by key, scanning
ranges of rows with ordered partitioning.
What is Apache HBase
hosted by
What is Apache Spark ?
Source: https://spark.apache.org/
Structured Data / SQL -
Spark SQL
Graph Processing -
GraphX
Machine Learning -
MLlib
Streaming - Spark Streaming,
Structured Streaming
hosted by
What is Scala
●  Scala is a modern multi-paradigm programming language
designed to express common programming patterns in a concise,
elegant, and type-safe way.
●  Scala is object-oriented
●  Scala is functional
●  Strongly typed, Type Inference
●  Higher Order Functions
●  Lazy Computation
Source: www.scala-lang.org
hosted by
Case Story: Retail Analytics
Architecting Fast Data Processing Platform to Scale 30 TB of Data in Production
Use cases in Retail Analytics:
Business: explain the who, what, when, where, why and how they are doing
Retailing.
●  What is selling as compared to what was being ordered.
●  Effective promotions - right promotions at right outlet and right time.
●  What types of Cigarette consumers are shopping in your outlets ?
○  Gives smoking patterns in specific geography, predict demand on supply.
●  What are the purchasing patterns of your consumers ?
○  are they purchasing Pizza and Ice cream together ?
○  are they purchasing multiple Instant food products with soda together ?
●  Time Series problem - year, month, day of year, week of year to Identify which
brands are not getting sold at specific geography, so it can be swap to other
store.
hosted by
Case Story: Retail Analytics - Scale
Challenges
²  Weekly Data refresh, Daily Data refresh batch Spark / Hadoop job
execution failures with unutilized Spark Cluster.
²  Scalability of massive data:
○  ~4.6 Billion events on every weekly data refresh.
○  Processing historical data: one time bulk load ~30 TB of data / ~17
billion transactional records.
²  Linear / sequential execution mode with broken data pipelines.
²  Joining ~17 billion transactional records with skewed data nature.
²  Data deduplication - outlet, item at retail.
hosted by
Using HBase as a MDM System
MDM - Master Data Management
1. HBase Driven Data Deduplication Algorithms
Example,
²  Outlet Matching
²  Item Matching
²  Address Matching
²  Brand Matching
2. Abbreviation Standardization
Example,
²  UOM Standardization
²  Outlet Name, Address Standardization
²  UPC Standardization
UOM Quantity
PACK 2
2PACK NA
2PK
hosted by
Using HBase as a MDM System
Data Deduplication Problem Retail Analytics !
Examples,
●  You may find Item with same UPC code.
●  You may find Outlet with same Outlet number.
●  What if UPC Code gets upgraded from 10 Digits to 14 Digits.
(Update everywhere, needs faster update.)
hosted by
HBase - NoSQL, Denormalized columnar schema model
For example,
Table: transactional_line_item
Column Family: f Column Family: o Column Family: t
created_datetime
file_id
transaction_id
manufacturer_operator_submitter_id
Outlet_id
Outlet_name
Outlet_state
Outlet_city
Outlet_address1
Outlet_address2
Outlet_region
Outlet_country
Outlet_owner_name
Outlet_status
Outlet_zipcode
Outlet_started_date
outlet_sub_chain_name
Transaction_id
Item_id
Promo_code
Gross_price
Discount
Quantity
Upc
Uom
State_gst
Country_gst
Delivery_charges
Packing_charges
Vendor_discount
Partner_discount
Corporate_discount
reward_point_discount
hosted by
Case Story: Retail Analytics - Scale
-  5x performance improvements by re-engineering entire data lake to analytical engine
pipeline’s.
-  Proposed highly concurrent, elastic, non-blocking, asynchronous architecture to save
customer’s ~22 hours runtime (~8 hours from 30 hours) for 4.6 Billion events.
-  10x performance improvements on historical load by under the hood algorithms
optimization on 17 Billion events (Benchmarks ~1 hour execution time only)
-  Master Data Management (MDM) - Deduplication and fuzzy logic matching on retails
data(Item, Outlet) improved elastic performance.
-  Using HBase as a Master Data Management (MDM) System.
hosted by
Case Story: Retail Analytics
How ?
hosted by
Data Platform
- Infrastructure Architecture
Oracle
POS Files
Kafka
Spark
Streaming
HBase
Hive / Spark /
Presto
Spark
MLLib
TensorFlow
Elastic
Search
Akka-HTTP
HDFS
Real time query
HBase Staging
Hive
Aggregation
Hive Layer
Read Rec by Rec
and do match
[2] maprcli Index
and setup
replication
[1] Asynchronous HBase:
https://github.com/OpenTSDB/asynchbase
[2] Index MapR-DB Data into Elasticsearch
https://community.mapr.com/community/exchange/blog/
2016/12/12/how-to-index-mapr-db-data-into-elasticsearch-on-aws
NodeJS
[3] NodeJS HBase
https://www.npmjs.com/package/
node-thrift2-hbase
https://www.npmjs.com/package/
async
https://www.npmjs.com/package/
thrift
hosted by
AsyncHBase Build.sbt with Akka HTTP
hosted by
Data Processing Infrastructure
●  MapR Distribution - http://archive.mapr.com/releases/
ecosystem-5.x/redhat/
●  Data Lake - Apache HBase 0.98.12
●  EDW / Analytical Data store - Apache Hive 1.2.1
●  Unified Execution Engine - Apache Spark 2.0.1
●  Distributed File storage - MapR-FS
●  Queueing mechanism - Apache Kafka
●  Streaming - HTTP Akka + scala 2.11 , Spark Streaming
●  Reporting Database - PostgreSQL 9.x
●  Legacy Database - Oracle 9
●  Workflow Management tool - BMC Control M
hosted by
Rethink
- Fast Data Architectures. Unify, Simplify.
UNIFIED fast data processing engine that provides:
The
SCALE
of data lake
The
RELIABILITY &
PERFORMANCE
of data warehouse
The
LOW LATENCY
of streaming
hosted by
Spark HBase Connector.
Credit: Contributors
https://github.com/nerdammer/spark-hbase-
connector
hosted by
Spark HBase Connector.
It’s a Spark package connector written on top of Java HBase API. A
simple and elegant way to write Spark - HBase Jobs. Powerful
Functional Scala DSL integrated for Apache Spark.
Supports:
●  Scala > 2.10
●  Spark > 1.6
●  HBase > 1.0
hosted by
Spark HBase Connector.
Dependency in build.sbt - libraryDependencies += "it.nerdammer.bigdata" % "spark-hbase-connector_2.10" % "1.0.3"
Setting the HBase Host
hosted by
Writing to HBase
hosted by
Reading from HBase
hosted by
Filtering
hosted by
Manage Empty Columns with Option[T]
hosted by
Custom Mapping with Case Classes
hosted by
Custom Mapping with Case Classes ...
hosted by
Implicit Reader
hosted by
Implicit Reader ...
Do not forget to override the columns method.
hosted by
HBase Read table Data in Spark DataFrame
hosted by
HBase Implicit Field Writer
hosted by
Save DataFrame to HBase
hosted by
Reference
[1] Apache HBase – Apache HBase™ Home
URL: https://hbase.apache.org/
[2] Architecting HBase Applications: A Guidebook for successful development and design by Jean-Marc
Spaggiari & Kevin O’Dell.
[3] AsyncHBase
URL: https://github.com/OpenTSDB/asynchbase
[4] Spark HBase Connector
URL: https://github.com/nerdammer/spark-hbase-connector
[5] NodeJS Thirft2 HBase package
URL: https://www.npmjs.com/package/node-thrift2-hbase
[6] NodeJS Async
URL: https://www.npmjs.com/package/async
[7] Akka Actor
URL: https://mvnrepository.com/artifact/com.typesafe.akka/akka-actor
[8] Akka HTTP Core
URL: https://mvnrepository.com/artifact/com.typesafe.akka/akka-http-core_2.11/10.0.1
[9] Akka HTTP Spray JSON
URL: https://mvnrepository.com/artifact/com.typesafe.akka/akka-http-spray-json-experimental
hosted by
Reference
[10] Kafka Clients
URL: https://mvnrepository.com/artifact/org.apache.kafka/kafka-clients
[11] Spray JSON
URL: https://mvnrepository.com/artifact/io.spray/spray-json
[12] Spray JSON Shapeless
URL: https://mvnrepository.com/artifact/com.github.fommil/spray-json-shapeless
[13] Scalamock scalatest
URL: https://mvnrepository.com/artifact/org.scalamock/scalamock-scalatest-support
hosted by
Thanks

More Related Content

What's hot

HBaseConAsia2018 Track3-5: HBase Practice at Lianjia
HBaseConAsia2018 Track3-5: HBase Practice at LianjiaHBaseConAsia2018 Track3-5: HBase Practice at Lianjia
HBaseConAsia2018 Track3-5: HBase Practice at Lianjia
Michael Stack
 
HBaseConAsia2018 Track1-1: Use CCSMap to improve HBase YGC time
HBaseConAsia2018 Track1-1: Use CCSMap to improve HBase YGC timeHBaseConAsia2018 Track1-1: Use CCSMap to improve HBase YGC time
HBaseConAsia2018 Track1-1: Use CCSMap to improve HBase YGC time
Michael Stack
 
HBaseConAsia2018 Track1-5: Improving HBase reliability at PInterest with geo ...
HBaseConAsia2018 Track1-5: Improving HBase reliability at PInterest with geo ...HBaseConAsia2018 Track1-5: Improving HBase reliability at PInterest with geo ...
HBaseConAsia2018 Track1-5: Improving HBase reliability at PInterest with geo ...
Michael Stack
 
HBaseConAsia2018 Track3-6: HBase at Meituan
HBaseConAsia2018 Track3-6: HBase at MeituanHBaseConAsia2018 Track3-6: HBase at Meituan
HBaseConAsia2018 Track3-6: HBase at Meituan
Michael Stack
 
HBaseConAsia2018 Track1-3: HBase at Xiaomi
HBaseConAsia2018 Track1-3: HBase at XiaomiHBaseConAsia2018 Track1-3: HBase at Xiaomi
HBaseConAsia2018 Track1-3: HBase at Xiaomi
Michael Stack
 
Hadoop and HBase @eBay
Hadoop and HBase @eBayHadoop and HBase @eBay
Hadoop and HBase @eBay
DataWorks Summit
 
Yahoo - Moving beyond running 100% of Apache Pig jobs on Apache Tez
Yahoo - Moving beyond running 100% of Apache Pig jobs on Apache TezYahoo - Moving beyond running 100% of Apache Pig jobs on Apache Tez
Yahoo - Moving beyond running 100% of Apache Pig jobs on Apache Tez
DataWorks Summit
 
HBaseCon 2012 | Building a Large Search Platform on a Shoestring Budget
HBaseCon 2012 | Building a Large Search Platform on a Shoestring BudgetHBaseCon 2012 | Building a Large Search Platform on a Shoestring Budget
HBaseCon 2012 | Building a Large Search Platform on a Shoestring Budget
Cloudera, Inc.
 
Splice Machine Overview
Splice Machine OverviewSplice Machine Overview
Splice Machine Overview
Kunal Gupta
 
Stsg17 speaker yousunjeong
Stsg17 speaker yousunjeongStsg17 speaker yousunjeong
Stsg17 speaker yousunjeong
Yousun Jeong
 
Amazon Elastic Map Reduce - Ian Meyers
Amazon Elastic Map Reduce - Ian MeyersAmazon Elastic Map Reduce - Ian Meyers
Amazon Elastic Map Reduce - Ian Meyers
huguk
 
HBaseCon 2013: Apache Drill - A Community-driven Initiative to Deliver ANSI S...
HBaseCon 2013: Apache Drill - A Community-driven Initiative to Deliver ANSI S...HBaseCon 2013: Apache Drill - A Community-driven Initiative to Deliver ANSI S...
HBaseCon 2013: Apache Drill - A Community-driven Initiative to Deliver ANSI S...
Cloudera, Inc.
 
Hadoop @ eBay: Past, Present, and Future
Hadoop @ eBay: Past, Present, and FutureHadoop @ eBay: Past, Present, and Future
Hadoop @ eBay: Past, Present, and Future
Ryan Hennig
 
Spark as part of a Hybrid RDBMS Architecture-John Leach Cofounder Splice Machine
Spark as part of a Hybrid RDBMS Architecture-John Leach Cofounder Splice MachineSpark as part of a Hybrid RDBMS Architecture-John Leach Cofounder Splice Machine
Spark as part of a Hybrid RDBMS Architecture-John Leach Cofounder Splice Machine
Data Con LA
 
A Non-Standard use Case of Hadoop: High Scale Image Processing and Analytics
A Non-Standard use Case of Hadoop: High Scale Image Processing and AnalyticsA Non-Standard use Case of Hadoop: High Scale Image Processing and Analytics
A Non-Standard use Case of Hadoop: High Scale Image Processing and Analytics
DataWorks Summit
 
HBaseCon 2015: Optimizing HBase for the Cloud in Microsoft Azure HDInsight
HBaseCon 2015: Optimizing HBase for the Cloud in Microsoft Azure HDInsightHBaseCon 2015: Optimizing HBase for the Cloud in Microsoft Azure HDInsight
HBaseCon 2015: Optimizing HBase for the Cloud in Microsoft Azure HDInsight
HBaseCon
 
Hadoop Networking at Datasift
Hadoop Networking at DatasiftHadoop Networking at Datasift
Hadoop Networking at Datasift
huguk
 
Change Data Capture with Data Collector @OVH
Change Data Capture with Data Collector @OVHChange Data Capture with Data Collector @OVH
Change Data Capture with Data Collector @OVH
Paris Data Engineers !
 
RedisConf17 - Home Depot - Turbo charging existing applications with Redis
RedisConf17 - Home Depot - Turbo charging existing applications with RedisRedisConf17 - Home Depot - Turbo charging existing applications with Redis
RedisConf17 - Home Depot - Turbo charging existing applications with Redis
Redis Labs
 
Presto @ Uber Hadoop summit2017
Presto @ Uber Hadoop summit2017Presto @ Uber Hadoop summit2017
Presto @ Uber Hadoop summit2017
Zhenxiao Luo
 

What's hot (20)

HBaseConAsia2018 Track3-5: HBase Practice at Lianjia
HBaseConAsia2018 Track3-5: HBase Practice at LianjiaHBaseConAsia2018 Track3-5: HBase Practice at Lianjia
HBaseConAsia2018 Track3-5: HBase Practice at Lianjia
 
HBaseConAsia2018 Track1-1: Use CCSMap to improve HBase YGC time
HBaseConAsia2018 Track1-1: Use CCSMap to improve HBase YGC timeHBaseConAsia2018 Track1-1: Use CCSMap to improve HBase YGC time
HBaseConAsia2018 Track1-1: Use CCSMap to improve HBase YGC time
 
HBaseConAsia2018 Track1-5: Improving HBase reliability at PInterest with geo ...
HBaseConAsia2018 Track1-5: Improving HBase reliability at PInterest with geo ...HBaseConAsia2018 Track1-5: Improving HBase reliability at PInterest with geo ...
HBaseConAsia2018 Track1-5: Improving HBase reliability at PInterest with geo ...
 
HBaseConAsia2018 Track3-6: HBase at Meituan
HBaseConAsia2018 Track3-6: HBase at MeituanHBaseConAsia2018 Track3-6: HBase at Meituan
HBaseConAsia2018 Track3-6: HBase at Meituan
 
HBaseConAsia2018 Track1-3: HBase at Xiaomi
HBaseConAsia2018 Track1-3: HBase at XiaomiHBaseConAsia2018 Track1-3: HBase at Xiaomi
HBaseConAsia2018 Track1-3: HBase at Xiaomi
 
Hadoop and HBase @eBay
Hadoop and HBase @eBayHadoop and HBase @eBay
Hadoop and HBase @eBay
 
Yahoo - Moving beyond running 100% of Apache Pig jobs on Apache Tez
Yahoo - Moving beyond running 100% of Apache Pig jobs on Apache TezYahoo - Moving beyond running 100% of Apache Pig jobs on Apache Tez
Yahoo - Moving beyond running 100% of Apache Pig jobs on Apache Tez
 
HBaseCon 2012 | Building a Large Search Platform on a Shoestring Budget
HBaseCon 2012 | Building a Large Search Platform on a Shoestring BudgetHBaseCon 2012 | Building a Large Search Platform on a Shoestring Budget
HBaseCon 2012 | Building a Large Search Platform on a Shoestring Budget
 
Splice Machine Overview
Splice Machine OverviewSplice Machine Overview
Splice Machine Overview
 
Stsg17 speaker yousunjeong
Stsg17 speaker yousunjeongStsg17 speaker yousunjeong
Stsg17 speaker yousunjeong
 
Amazon Elastic Map Reduce - Ian Meyers
Amazon Elastic Map Reduce - Ian MeyersAmazon Elastic Map Reduce - Ian Meyers
Amazon Elastic Map Reduce - Ian Meyers
 
HBaseCon 2013: Apache Drill - A Community-driven Initiative to Deliver ANSI S...
HBaseCon 2013: Apache Drill - A Community-driven Initiative to Deliver ANSI S...HBaseCon 2013: Apache Drill - A Community-driven Initiative to Deliver ANSI S...
HBaseCon 2013: Apache Drill - A Community-driven Initiative to Deliver ANSI S...
 
Hadoop @ eBay: Past, Present, and Future
Hadoop @ eBay: Past, Present, and FutureHadoop @ eBay: Past, Present, and Future
Hadoop @ eBay: Past, Present, and Future
 
Spark as part of a Hybrid RDBMS Architecture-John Leach Cofounder Splice Machine
Spark as part of a Hybrid RDBMS Architecture-John Leach Cofounder Splice MachineSpark as part of a Hybrid RDBMS Architecture-John Leach Cofounder Splice Machine
Spark as part of a Hybrid RDBMS Architecture-John Leach Cofounder Splice Machine
 
A Non-Standard use Case of Hadoop: High Scale Image Processing and Analytics
A Non-Standard use Case of Hadoop: High Scale Image Processing and AnalyticsA Non-Standard use Case of Hadoop: High Scale Image Processing and Analytics
A Non-Standard use Case of Hadoop: High Scale Image Processing and Analytics
 
HBaseCon 2015: Optimizing HBase for the Cloud in Microsoft Azure HDInsight
HBaseCon 2015: Optimizing HBase for the Cloud in Microsoft Azure HDInsightHBaseCon 2015: Optimizing HBase for the Cloud in Microsoft Azure HDInsight
HBaseCon 2015: Optimizing HBase for the Cloud in Microsoft Azure HDInsight
 
Hadoop Networking at Datasift
Hadoop Networking at DatasiftHadoop Networking at Datasift
Hadoop Networking at Datasift
 
Change Data Capture with Data Collector @OVH
Change Data Capture with Data Collector @OVHChange Data Capture with Data Collector @OVH
Change Data Capture with Data Collector @OVH
 
RedisConf17 - Home Depot - Turbo charging existing applications with Redis
RedisConf17 - Home Depot - Turbo charging existing applications with RedisRedisConf17 - Home Depot - Turbo charging existing applications with Redis
RedisConf17 - Home Depot - Turbo charging existing applications with Redis
 
Presto @ Uber Hadoop summit2017
Presto @ Uber Hadoop summit2017Presto @ Uber Hadoop summit2017
Presto @ Uber Hadoop summit2017
 

Similar to HBaseConAsia2018 Track2-6: Scaling 30TB's of data lake with Apache HBase and Scala DSL in Production

SnappyData Toronto Meetup Nov 2017
SnappyData Toronto Meetup Nov 2017SnappyData Toronto Meetup Nov 2017
SnappyData Toronto Meetup Nov 2017
SnappyData
 
Atlanta MLConf
Atlanta MLConfAtlanta MLConf
Atlanta MLConf
Qubole
 
Jason Huang, Solutions Engineer, Qubole at MLconf ATL - 9/18/15
Jason Huang, Solutions Engineer, Qubole at MLconf ATL - 9/18/15Jason Huang, Solutions Engineer, Qubole at MLconf ATL - 9/18/15
Jason Huang, Solutions Engineer, Qubole at MLconf ATL - 9/18/15
MLconf
 
Managing data analytics in a hybrid cloud
Managing data analytics in a hybrid cloudManaging data analytics in a hybrid cloud
Managing data analytics in a hybrid cloud
Karan Singh
 
AWS (Hadoop) Meetup 30.04.09
AWS (Hadoop) Meetup 30.04.09AWS (Hadoop) Meetup 30.04.09
AWS (Hadoop) Meetup 30.04.09
Chris Purrington
 
Sandish3Certs
Sandish3CertsSandish3Certs
Sandish3Certs
Sandish Kumar H N
 
20160331 sa introduction to big data pipelining berlin meetup 0.3
20160331 sa introduction to big data pipelining berlin meetup   0.320160331 sa introduction to big data pipelining berlin meetup   0.3
20160331 sa introduction to big data pipelining berlin meetup 0.3
Simon Ambridge
 
Intro to Apache Spark by CTO of Twingo
Intro to Apache Spark by CTO of TwingoIntro to Apache Spark by CTO of Twingo
Intro to Apache Spark by CTO of Twingo
MapR Technologies
 
SQL Engines for Hadoop - The case for Impala
SQL Engines for Hadoop - The case for ImpalaSQL Engines for Hadoop - The case for Impala
SQL Engines for Hadoop - The case for Impala
markgrover
 
Data streaming
Data streamingData streaming
Data streaming
Alberto Paro
 
Big Data_Architecture.pptx
Big Data_Architecture.pptxBig Data_Architecture.pptx
Big Data_Architecture.pptx
betalab
 
Apache Spark and MongoDB - Turning Analytics into Real-Time Action
Apache Spark and MongoDB - Turning Analytics into Real-Time ActionApache Spark and MongoDB - Turning Analytics into Real-Time Action
Apache Spark and MongoDB - Turning Analytics into Real-Time Action
João Gabriel Lima
 
Unlock the value of big data with the DX2000 from NEC
Unlock the value of big data with the DX2000 from NECUnlock the value of big data with the DX2000 from NEC
Unlock the value of big data with the DX2000 from NEC
Principled Technologies
 
Deploy Apache Spark™ on Rackspace OnMetal™ for Cloud Big Data Platform
Deploy Apache Spark™ on Rackspace OnMetal™ for Cloud Big Data PlatformDeploy Apache Spark™ on Rackspace OnMetal™ for Cloud Big Data Platform
Deploy Apache Spark™ on Rackspace OnMetal™ for Cloud Big Data Platform
Rackspace
 
Agile data warehousing
Agile data warehousingAgile data warehousing
Agile data warehousing
Sneha Challa
 
Introduction to spark
Introduction to sparkIntroduction to spark
Introduction to spark
Home
 
Apache Spark PDF
Apache Spark PDFApache Spark PDF
Apache Spark PDF
Naresh Rupareliya
 
ETL to ML: Use Apache Spark as an end to end tool for Advanced Analytics
ETL to ML: Use Apache Spark as an end to end tool for Advanced AnalyticsETL to ML: Use Apache Spark as an end to end tool for Advanced Analytics
ETL to ML: Use Apache Spark as an end to end tool for Advanced Analytics
Miklos Christine
 
Uncovering an Apache Spark 2 Benchmark - Configuration, Tuning and Test Results
Uncovering an Apache Spark 2 Benchmark - Configuration, Tuning and Test ResultsUncovering an Apache Spark 2 Benchmark - Configuration, Tuning and Test Results
Uncovering an Apache Spark 2 Benchmark - Configuration, Tuning and Test Results
DataWorks Summit
 
Comparison among rdbms, hadoop and spark
Comparison among rdbms, hadoop and sparkComparison among rdbms, hadoop and spark
Comparison among rdbms, hadoop and spark
AgnihotriGhosh2
 

Similar to HBaseConAsia2018 Track2-6: Scaling 30TB's of data lake with Apache HBase and Scala DSL in Production (20)

SnappyData Toronto Meetup Nov 2017
SnappyData Toronto Meetup Nov 2017SnappyData Toronto Meetup Nov 2017
SnappyData Toronto Meetup Nov 2017
 
Atlanta MLConf
Atlanta MLConfAtlanta MLConf
Atlanta MLConf
 
Jason Huang, Solutions Engineer, Qubole at MLconf ATL - 9/18/15
Jason Huang, Solutions Engineer, Qubole at MLconf ATL - 9/18/15Jason Huang, Solutions Engineer, Qubole at MLconf ATL - 9/18/15
Jason Huang, Solutions Engineer, Qubole at MLconf ATL - 9/18/15
 
Managing data analytics in a hybrid cloud
Managing data analytics in a hybrid cloudManaging data analytics in a hybrid cloud
Managing data analytics in a hybrid cloud
 
AWS (Hadoop) Meetup 30.04.09
AWS (Hadoop) Meetup 30.04.09AWS (Hadoop) Meetup 30.04.09
AWS (Hadoop) Meetup 30.04.09
 
Sandish3Certs
Sandish3CertsSandish3Certs
Sandish3Certs
 
20160331 sa introduction to big data pipelining berlin meetup 0.3
20160331 sa introduction to big data pipelining berlin meetup   0.320160331 sa introduction to big data pipelining berlin meetup   0.3
20160331 sa introduction to big data pipelining berlin meetup 0.3
 
Intro to Apache Spark by CTO of Twingo
Intro to Apache Spark by CTO of TwingoIntro to Apache Spark by CTO of Twingo
Intro to Apache Spark by CTO of Twingo
 
SQL Engines for Hadoop - The case for Impala
SQL Engines for Hadoop - The case for ImpalaSQL Engines for Hadoop - The case for Impala
SQL Engines for Hadoop - The case for Impala
 
Data streaming
Data streamingData streaming
Data streaming
 
Big Data_Architecture.pptx
Big Data_Architecture.pptxBig Data_Architecture.pptx
Big Data_Architecture.pptx
 
Apache Spark and MongoDB - Turning Analytics into Real-Time Action
Apache Spark and MongoDB - Turning Analytics into Real-Time ActionApache Spark and MongoDB - Turning Analytics into Real-Time Action
Apache Spark and MongoDB - Turning Analytics into Real-Time Action
 
Unlock the value of big data with the DX2000 from NEC
Unlock the value of big data with the DX2000 from NECUnlock the value of big data with the DX2000 from NEC
Unlock the value of big data with the DX2000 from NEC
 
Deploy Apache Spark™ on Rackspace OnMetal™ for Cloud Big Data Platform
Deploy Apache Spark™ on Rackspace OnMetal™ for Cloud Big Data PlatformDeploy Apache Spark™ on Rackspace OnMetal™ for Cloud Big Data Platform
Deploy Apache Spark™ on Rackspace OnMetal™ for Cloud Big Data Platform
 
Agile data warehousing
Agile data warehousingAgile data warehousing
Agile data warehousing
 
Introduction to spark
Introduction to sparkIntroduction to spark
Introduction to spark
 
Apache Spark PDF
Apache Spark PDFApache Spark PDF
Apache Spark PDF
 
ETL to ML: Use Apache Spark as an end to end tool for Advanced Analytics
ETL to ML: Use Apache Spark as an end to end tool for Advanced AnalyticsETL to ML: Use Apache Spark as an end to end tool for Advanced Analytics
ETL to ML: Use Apache Spark as an end to end tool for Advanced Analytics
 
Uncovering an Apache Spark 2 Benchmark - Configuration, Tuning and Test Results
Uncovering an Apache Spark 2 Benchmark - Configuration, Tuning and Test ResultsUncovering an Apache Spark 2 Benchmark - Configuration, Tuning and Test Results
Uncovering an Apache Spark 2 Benchmark - Configuration, Tuning and Test Results
 
Comparison among rdbms, hadoop and spark
Comparison among rdbms, hadoop and sparkComparison among rdbms, hadoop and spark
Comparison among rdbms, hadoop and spark
 

More from Michael Stack

hbaseconasia2019 HBase Table Monitoring and Troubleshooting System on Cloud
hbaseconasia2019 HBase Table Monitoring and Troubleshooting System on Cloudhbaseconasia2019 HBase Table Monitoring and Troubleshooting System on Cloud
hbaseconasia2019 HBase Table Monitoring and Troubleshooting System on Cloud
Michael Stack
 
hbaseconasia2019 Recent work on HBase at Pinterest
hbaseconasia2019 Recent work on HBase at Pinteresthbaseconasia2019 Recent work on HBase at Pinterest
hbaseconasia2019 Recent work on HBase at Pinterest
Michael Stack
 
hbaseconasia2019 Phoenix Practice in China Life Insurance Co., Ltd
hbaseconasia2019 Phoenix Practice in China Life Insurance Co., Ltdhbaseconasia2019 Phoenix Practice in China Life Insurance Co., Ltd
hbaseconasia2019 Phoenix Practice in China Life Insurance Co., Ltd
Michael Stack
 
hbaseconasia2019 HBase at Didi
hbaseconasia2019 HBase at Didihbaseconasia2019 HBase at Didi
hbaseconasia2019 HBase at Didi
Michael Stack
 
hbaseconasia2019 The Practice in trillion-level Video Storage and billion-lev...
hbaseconasia2019 The Practice in trillion-level Video Storage and billion-lev...hbaseconasia2019 The Practice in trillion-level Video Storage and billion-lev...
hbaseconasia2019 The Practice in trillion-level Video Storage and billion-lev...
Michael Stack
 
hbaseconasia2019 HBase at Tencent
hbaseconasia2019 HBase at Tencenthbaseconasia2019 HBase at Tencent
hbaseconasia2019 HBase at Tencent
Michael Stack
 
hbaseconasia2019 Spatio temporal Data Management based on Ali-HBase Ganos and...
hbaseconasia2019 Spatio temporal Data Management based on Ali-HBase Ganos and...hbaseconasia2019 Spatio temporal Data Management based on Ali-HBase Ganos and...
hbaseconasia2019 Spatio temporal Data Management based on Ali-HBase Ganos and...
Michael Stack
 
hbaseconasia2019 Bridging the Gap between Big Data System Software Stack and ...
hbaseconasia2019 Bridging the Gap between Big Data System Software Stack and ...hbaseconasia2019 Bridging the Gap between Big Data System Software Stack and ...
hbaseconasia2019 Bridging the Gap between Big Data System Software Stack and ...
Michael Stack
 
hbaseconasia2019 Pharos as a Pluggable Secondary Index Component
hbaseconasia2019 Pharos as a Pluggable Secondary Index Componenthbaseconasia2019 Pharos as a Pluggable Secondary Index Component
hbaseconasia2019 Pharos as a Pluggable Secondary Index Component
Michael Stack
 
hbaseconasia2019 Phoenix Improvements and Practices on Cloud HBase at Alibaba
hbaseconasia2019 Phoenix Improvements and Practices on Cloud HBase at Alibabahbaseconasia2019 Phoenix Improvements and Practices on Cloud HBase at Alibaba
hbaseconasia2019 Phoenix Improvements and Practices on Cloud HBase at Alibaba
Michael Stack
 
hbaseconasia2019 OpenTSDB at Xiaomi
hbaseconasia2019 OpenTSDB at Xiaomihbaseconasia2019 OpenTSDB at Xiaomi
hbaseconasia2019 OpenTSDB at Xiaomi
Michael Stack
 
hbaseconasia2019 BigData NoSQL System: ApsaraDB, HBase and Spark
hbaseconasia2019 BigData NoSQL System: ApsaraDB, HBase and Sparkhbaseconasia2019 BigData NoSQL System: ApsaraDB, HBase and Spark
hbaseconasia2019 BigData NoSQL System: ApsaraDB, HBase and Spark
Michael Stack
 
hbaseconasia2019 Test-suite for Automating Data-consistency checks on HBase
hbaseconasia2019 Test-suite for Automating Data-consistency checks on HBasehbaseconasia2019 Test-suite for Automating Data-consistency checks on HBase
hbaseconasia2019 Test-suite for Automating Data-consistency checks on HBase
Michael Stack
 
hbaseconasia2019 Distributed Bitmap Index Solution
hbaseconasia2019 Distributed Bitmap Index Solutionhbaseconasia2019 Distributed Bitmap Index Solution
hbaseconasia2019 Distributed Bitmap Index Solution
Michael Stack
 
hbaseconasia2019 HBase Bucket Cache on Persistent Memory
hbaseconasia2019 HBase Bucket Cache on Persistent Memoryhbaseconasia2019 HBase Bucket Cache on Persistent Memory
hbaseconasia2019 HBase Bucket Cache on Persistent Memory
Michael Stack
 
hbaseconasia2019 The Procedure v2 Implementation of WAL Splitting and ACL
hbaseconasia2019 The Procedure v2 Implementation of WAL Splitting and ACLhbaseconasia2019 The Procedure v2 Implementation of WAL Splitting and ACL
hbaseconasia2019 The Procedure v2 Implementation of WAL Splitting and ACL
Michael Stack
 
hbaseconasia2019 BDS: A data synchronization platform for HBase
hbaseconasia2019 BDS: A data synchronization platform for HBasehbaseconasia2019 BDS: A data synchronization platform for HBase
hbaseconasia2019 BDS: A data synchronization platform for HBase
Michael Stack
 
hbaseconasia2019 Further GC optimization for HBase 2.x: Reading HFileBlock in...
hbaseconasia2019 Further GC optimization for HBase 2.x: Reading HFileBlock in...hbaseconasia2019 Further GC optimization for HBase 2.x: Reading HFileBlock in...
hbaseconasia2019 Further GC optimization for HBase 2.x: Reading HFileBlock in...
Michael Stack
 
hbaseconasia2019 HBCK2: Concepts, trends, and recipes for fixing issues in HB...
hbaseconasia2019 HBCK2: Concepts, trends, and recipes for fixing issues in HB...hbaseconasia2019 HBCK2: Concepts, trends, and recipes for fixing issues in HB...
hbaseconasia2019 HBCK2: Concepts, trends, and recipes for fixing issues in HB...
Michael Stack
 
HBaseConAsia2019 Keynote
HBaseConAsia2019 KeynoteHBaseConAsia2019 Keynote
HBaseConAsia2019 Keynote
Michael Stack
 

More from Michael Stack (20)

hbaseconasia2019 HBase Table Monitoring and Troubleshooting System on Cloud
hbaseconasia2019 HBase Table Monitoring and Troubleshooting System on Cloudhbaseconasia2019 HBase Table Monitoring and Troubleshooting System on Cloud
hbaseconasia2019 HBase Table Monitoring and Troubleshooting System on Cloud
 
hbaseconasia2019 Recent work on HBase at Pinterest
hbaseconasia2019 Recent work on HBase at Pinteresthbaseconasia2019 Recent work on HBase at Pinterest
hbaseconasia2019 Recent work on HBase at Pinterest
 
hbaseconasia2019 Phoenix Practice in China Life Insurance Co., Ltd
hbaseconasia2019 Phoenix Practice in China Life Insurance Co., Ltdhbaseconasia2019 Phoenix Practice in China Life Insurance Co., Ltd
hbaseconasia2019 Phoenix Practice in China Life Insurance Co., Ltd
 
hbaseconasia2019 HBase at Didi
hbaseconasia2019 HBase at Didihbaseconasia2019 HBase at Didi
hbaseconasia2019 HBase at Didi
 
hbaseconasia2019 The Practice in trillion-level Video Storage and billion-lev...
hbaseconasia2019 The Practice in trillion-level Video Storage and billion-lev...hbaseconasia2019 The Practice in trillion-level Video Storage and billion-lev...
hbaseconasia2019 The Practice in trillion-level Video Storage and billion-lev...
 
hbaseconasia2019 HBase at Tencent
hbaseconasia2019 HBase at Tencenthbaseconasia2019 HBase at Tencent
hbaseconasia2019 HBase at Tencent
 
hbaseconasia2019 Spatio temporal Data Management based on Ali-HBase Ganos and...
hbaseconasia2019 Spatio temporal Data Management based on Ali-HBase Ganos and...hbaseconasia2019 Spatio temporal Data Management based on Ali-HBase Ganos and...
hbaseconasia2019 Spatio temporal Data Management based on Ali-HBase Ganos and...
 
hbaseconasia2019 Bridging the Gap between Big Data System Software Stack and ...
hbaseconasia2019 Bridging the Gap between Big Data System Software Stack and ...hbaseconasia2019 Bridging the Gap between Big Data System Software Stack and ...
hbaseconasia2019 Bridging the Gap between Big Data System Software Stack and ...
 
hbaseconasia2019 Pharos as a Pluggable Secondary Index Component
hbaseconasia2019 Pharos as a Pluggable Secondary Index Componenthbaseconasia2019 Pharos as a Pluggable Secondary Index Component
hbaseconasia2019 Pharos as a Pluggable Secondary Index Component
 
hbaseconasia2019 Phoenix Improvements and Practices on Cloud HBase at Alibaba
hbaseconasia2019 Phoenix Improvements and Practices on Cloud HBase at Alibabahbaseconasia2019 Phoenix Improvements and Practices on Cloud HBase at Alibaba
hbaseconasia2019 Phoenix Improvements and Practices on Cloud HBase at Alibaba
 
hbaseconasia2019 OpenTSDB at Xiaomi
hbaseconasia2019 OpenTSDB at Xiaomihbaseconasia2019 OpenTSDB at Xiaomi
hbaseconasia2019 OpenTSDB at Xiaomi
 
hbaseconasia2019 BigData NoSQL System: ApsaraDB, HBase and Spark
hbaseconasia2019 BigData NoSQL System: ApsaraDB, HBase and Sparkhbaseconasia2019 BigData NoSQL System: ApsaraDB, HBase and Spark
hbaseconasia2019 BigData NoSQL System: ApsaraDB, HBase and Spark
 
hbaseconasia2019 Test-suite for Automating Data-consistency checks on HBase
hbaseconasia2019 Test-suite for Automating Data-consistency checks on HBasehbaseconasia2019 Test-suite for Automating Data-consistency checks on HBase
hbaseconasia2019 Test-suite for Automating Data-consistency checks on HBase
 
hbaseconasia2019 Distributed Bitmap Index Solution
hbaseconasia2019 Distributed Bitmap Index Solutionhbaseconasia2019 Distributed Bitmap Index Solution
hbaseconasia2019 Distributed Bitmap Index Solution
 
hbaseconasia2019 HBase Bucket Cache on Persistent Memory
hbaseconasia2019 HBase Bucket Cache on Persistent Memoryhbaseconasia2019 HBase Bucket Cache on Persistent Memory
hbaseconasia2019 HBase Bucket Cache on Persistent Memory
 
hbaseconasia2019 The Procedure v2 Implementation of WAL Splitting and ACL
hbaseconasia2019 The Procedure v2 Implementation of WAL Splitting and ACLhbaseconasia2019 The Procedure v2 Implementation of WAL Splitting and ACL
hbaseconasia2019 The Procedure v2 Implementation of WAL Splitting and ACL
 
hbaseconasia2019 BDS: A data synchronization platform for HBase
hbaseconasia2019 BDS: A data synchronization platform for HBasehbaseconasia2019 BDS: A data synchronization platform for HBase
hbaseconasia2019 BDS: A data synchronization platform for HBase
 
hbaseconasia2019 Further GC optimization for HBase 2.x: Reading HFileBlock in...
hbaseconasia2019 Further GC optimization for HBase 2.x: Reading HFileBlock in...hbaseconasia2019 Further GC optimization for HBase 2.x: Reading HFileBlock in...
hbaseconasia2019 Further GC optimization for HBase 2.x: Reading HFileBlock in...
 
hbaseconasia2019 HBCK2: Concepts, trends, and recipes for fixing issues in HB...
hbaseconasia2019 HBCK2: Concepts, trends, and recipes for fixing issues in HB...hbaseconasia2019 HBCK2: Concepts, trends, and recipes for fixing issues in HB...
hbaseconasia2019 HBCK2: Concepts, trends, and recipes for fixing issues in HB...
 
HBaseConAsia2019 Keynote
HBaseConAsia2019 KeynoteHBaseConAsia2019 Keynote
HBaseConAsia2019 Keynote
 

Recently uploaded

Tarun Gaur On Data Breaches and Privacy Fears
Tarun Gaur On Data Breaches and Privacy FearsTarun Gaur On Data Breaches and Privacy Fears
Tarun Gaur On Data Breaches and Privacy Fears
Tarun Gaur
 
High Profile Girls Call ServiCe Chennai XX00XXX00X Tanisha Best High Class Ch...
High Profile Girls Call ServiCe Chennai XX00XXX00X Tanisha Best High Class Ch...High Profile Girls Call ServiCe Chennai XX00XXX00X Tanisha Best High Class Ch...
High Profile Girls Call ServiCe Chennai XX00XXX00X Tanisha Best High Class Ch...
shamrisumri
 
Information Systems Auditing, Controls and Assurance , tanapat limsaiprom
Information Systems Auditing, Controls and Assurance , tanapat limsaipromInformation Systems Auditing, Controls and Assurance , tanapat limsaiprom
Information Systems Auditing, Controls and Assurance , tanapat limsaiprom
TanapatLimsaiprom1
 
Chennai Girls Call ServiCe X00XXX00XX Tanisha Best High Class Chennai Available
Chennai Girls Call ServiCe X00XXX00XX Tanisha Best High Class Chennai AvailableChennai Girls Call ServiCe X00XXX00XX Tanisha Best High Class Chennai Available
Chennai Girls Call ServiCe X00XXX00XX Tanisha Best High Class Chennai Available
shamrisumri
 
Kolkata @Girls @Call WhatsApp Numbers 🫦0000XX0000🫦 List For Friendship Girls ...
Kolkata @Girls @Call WhatsApp Numbers 🫦0000XX0000🫦 List For Friendship Girls ...Kolkata @Girls @Call WhatsApp Numbers 🫦0000XX0000🫦 List For Friendship Girls ...
Kolkata @Girls @Call WhatsApp Numbers 🫦0000XX0000🫦 List For Friendship Girls ...
paridubey2024#G05
 
Network Layer and its protocols mod .pptx
Network Layer and its protocols mod .pptxNetwork Layer and its protocols mod .pptx
Network Layer and its protocols mod .pptx
cossykin19
 
How-to-Diagnose-Hard-Drives-by-DFL-DDP-2024.pdf
How-to-Diagnose-Hard-Drives-by-DFL-DDP-2024.pdfHow-to-Diagnose-Hard-Drives-by-DFL-DDP-2024.pdf
How-to-Diagnose-Hard-Drives-by-DFL-DDP-2024.pdf
Dolphin Data Lab
 
workbook and project U5 1ºsecundaria.pdf
workbook and project U5 1ºsecundaria.pdfworkbook and project U5 1ºsecundaria.pdf
workbook and project U5 1ºsecundaria.pdf
anya2024forgya
 
@Girls @Call Chennai 🛬 XXXXXXXXXX 🛬 available 24*7 cash payment book now pay ...
@Girls @Call Chennai 🛬 XXXXXXXXXX 🛬 available 24*7 cash payment book now pay ...@Girls @Call Chennai 🛬 XXXXXXXXXX 🛬 available 24*7 cash payment book now pay ...
@Girls @Call Chennai 🛬 XXXXXXXXXX 🛬 available 24*7 cash payment book now pay ...
shamrisumri
 
Draya Michele’s Son – Kniko Howard’s Rise to Fame.pptx
Draya Michele’s Son – Kniko Howard’s Rise to Fame.pptxDraya Michele’s Son – Kniko Howard’s Rise to Fame.pptx
Draya Michele’s Son – Kniko Howard’s Rise to Fame.pptx
ashishkumarrana9
 
2023. Archive - Gigabajtos selfpublisher homepage
2023. Archive - Gigabajtos selfpublisher homepage2023. Archive - Gigabajtos selfpublisher homepage
2023. Archive - Gigabajtos selfpublisher homepage
Zsolt Nemeth
 
Iot-Internet-of-Things_Industrial revolution 4.0-ppt.pptx
Iot-Internet-of-Things_Industrial revolution 4.0-ppt.pptxIot-Internet-of-Things_Industrial revolution 4.0-ppt.pptx
Iot-Internet-of-Things_Industrial revolution 4.0-ppt.pptx
DeepakKumar862274
 
Effective Tips for Creating the Best Rich Media Ads .pptx
Effective Tips for Creating the Best Rich Media Ads .pptxEffective Tips for Creating the Best Rich Media Ads .pptx
Effective Tips for Creating the Best Rich Media Ads .pptx
AirtoryInc
 
Girls Call Shimla 000XX00000 Provide Best And Top Girl Service And No1 in City
Girls Call Shimla 000XX00000 Provide Best And Top Girl Service And No1 in CityGirls Call Shimla 000XX00000 Provide Best And Top Girl Service And No1 in City
Girls Call Shimla 000XX00000 Provide Best And Top Girl Service And No1 in City
dilbaagsingh0898
 
Jarren Duran Fuck EM T shirts Jarren Duran Fuck EM T shirts
Jarren Duran Fuck EM T shirts Jarren Duran Fuck EM T shirtsJarren Duran Fuck EM T shirts Jarren Duran Fuck EM T shirts
Jarren Duran Fuck EM T shirts Jarren Duran Fuck EM T shirts
exgf28
 
Portugal Dreamin 24 - How to easily use an API with Flows
Portugal Dreamin 24  - How to easily use an API with FlowsPortugal Dreamin 24  - How to easily use an API with Flows
Portugal Dreamin 24 - How to easily use an API with Flows
Thierry TROUIN ☁
 
AWS Networking Basic , tanapat limsaiprom
AWS Networking Basic , tanapat limsaipromAWS Networking Basic , tanapat limsaiprom
AWS Networking Basic , tanapat limsaiprom
ธนาพัฒน์ ลิ้มสายพรหม
 
6 Reasons to Use a VPN | 3S VPN Server App
6 Reasons to Use a VPN | 3S VPN Server App6 Reasons to Use a VPN | 3S VPN Server App
6 Reasons to Use a VPN | 3S VPN Server App
VPN Server
 
Ontology for the semantic enhancement, database definition and management and...
Ontology for the semantic enhancement, database definition and management and...Ontology for the semantic enhancement, database definition and management and...
Ontology for the semantic enhancement, database definition and management and...
Edward Blurock
 
Quiz Quiz Hota Hai (School Quiz 2018-19)
Quiz Quiz Hota Hai (School Quiz 2018-19)Quiz Quiz Hota Hai (School Quiz 2018-19)
Quiz Quiz Hota Hai (School Quiz 2018-19)
Kashyap J
 

Recently uploaded (20)

Tarun Gaur On Data Breaches and Privacy Fears
Tarun Gaur On Data Breaches and Privacy FearsTarun Gaur On Data Breaches and Privacy Fears
Tarun Gaur On Data Breaches and Privacy Fears
 
High Profile Girls Call ServiCe Chennai XX00XXX00X Tanisha Best High Class Ch...
High Profile Girls Call ServiCe Chennai XX00XXX00X Tanisha Best High Class Ch...High Profile Girls Call ServiCe Chennai XX00XXX00X Tanisha Best High Class Ch...
High Profile Girls Call ServiCe Chennai XX00XXX00X Tanisha Best High Class Ch...
 
Information Systems Auditing, Controls and Assurance , tanapat limsaiprom
Information Systems Auditing, Controls and Assurance , tanapat limsaipromInformation Systems Auditing, Controls and Assurance , tanapat limsaiprom
Information Systems Auditing, Controls and Assurance , tanapat limsaiprom
 
Chennai Girls Call ServiCe X00XXX00XX Tanisha Best High Class Chennai Available
Chennai Girls Call ServiCe X00XXX00XX Tanisha Best High Class Chennai AvailableChennai Girls Call ServiCe X00XXX00XX Tanisha Best High Class Chennai Available
Chennai Girls Call ServiCe X00XXX00XX Tanisha Best High Class Chennai Available
 
Kolkata @Girls @Call WhatsApp Numbers 🫦0000XX0000🫦 List For Friendship Girls ...
Kolkata @Girls @Call WhatsApp Numbers 🫦0000XX0000🫦 List For Friendship Girls ...Kolkata @Girls @Call WhatsApp Numbers 🫦0000XX0000🫦 List For Friendship Girls ...
Kolkata @Girls @Call WhatsApp Numbers 🫦0000XX0000🫦 List For Friendship Girls ...
 
Network Layer and its protocols mod .pptx
Network Layer and its protocols mod .pptxNetwork Layer and its protocols mod .pptx
Network Layer and its protocols mod .pptx
 
How-to-Diagnose-Hard-Drives-by-DFL-DDP-2024.pdf
How-to-Diagnose-Hard-Drives-by-DFL-DDP-2024.pdfHow-to-Diagnose-Hard-Drives-by-DFL-DDP-2024.pdf
How-to-Diagnose-Hard-Drives-by-DFL-DDP-2024.pdf
 
workbook and project U5 1ºsecundaria.pdf
workbook and project U5 1ºsecundaria.pdfworkbook and project U5 1ºsecundaria.pdf
workbook and project U5 1ºsecundaria.pdf
 
@Girls @Call Chennai 🛬 XXXXXXXXXX 🛬 available 24*7 cash payment book now pay ...
@Girls @Call Chennai 🛬 XXXXXXXXXX 🛬 available 24*7 cash payment book now pay ...@Girls @Call Chennai 🛬 XXXXXXXXXX 🛬 available 24*7 cash payment book now pay ...
@Girls @Call Chennai 🛬 XXXXXXXXXX 🛬 available 24*7 cash payment book now pay ...
 
Draya Michele’s Son – Kniko Howard’s Rise to Fame.pptx
Draya Michele’s Son – Kniko Howard’s Rise to Fame.pptxDraya Michele’s Son – Kniko Howard’s Rise to Fame.pptx
Draya Michele’s Son – Kniko Howard’s Rise to Fame.pptx
 
2023. Archive - Gigabajtos selfpublisher homepage
2023. Archive - Gigabajtos selfpublisher homepage2023. Archive - Gigabajtos selfpublisher homepage
2023. Archive - Gigabajtos selfpublisher homepage
 
Iot-Internet-of-Things_Industrial revolution 4.0-ppt.pptx
Iot-Internet-of-Things_Industrial revolution 4.0-ppt.pptxIot-Internet-of-Things_Industrial revolution 4.0-ppt.pptx
Iot-Internet-of-Things_Industrial revolution 4.0-ppt.pptx
 
Effective Tips for Creating the Best Rich Media Ads .pptx
Effective Tips for Creating the Best Rich Media Ads .pptxEffective Tips for Creating the Best Rich Media Ads .pptx
Effective Tips for Creating the Best Rich Media Ads .pptx
 
Girls Call Shimla 000XX00000 Provide Best And Top Girl Service And No1 in City
Girls Call Shimla 000XX00000 Provide Best And Top Girl Service And No1 in CityGirls Call Shimla 000XX00000 Provide Best And Top Girl Service And No1 in City
Girls Call Shimla 000XX00000 Provide Best And Top Girl Service And No1 in City
 
Jarren Duran Fuck EM T shirts Jarren Duran Fuck EM T shirts
Jarren Duran Fuck EM T shirts Jarren Duran Fuck EM T shirtsJarren Duran Fuck EM T shirts Jarren Duran Fuck EM T shirts
Jarren Duran Fuck EM T shirts Jarren Duran Fuck EM T shirts
 
Portugal Dreamin 24 - How to easily use an API with Flows
Portugal Dreamin 24  - How to easily use an API with FlowsPortugal Dreamin 24  - How to easily use an API with Flows
Portugal Dreamin 24 - How to easily use an API with Flows
 
AWS Networking Basic , tanapat limsaiprom
AWS Networking Basic , tanapat limsaipromAWS Networking Basic , tanapat limsaiprom
AWS Networking Basic , tanapat limsaiprom
 
6 Reasons to Use a VPN | 3S VPN Server App
6 Reasons to Use a VPN | 3S VPN Server App6 Reasons to Use a VPN | 3S VPN Server App
6 Reasons to Use a VPN | 3S VPN Server App
 
Ontology for the semantic enhancement, database definition and management and...
Ontology for the semantic enhancement, database definition and management and...Ontology for the semantic enhancement, database definition and management and...
Ontology for the semantic enhancement, database definition and management and...
 
Quiz Quiz Hota Hai (School Quiz 2018-19)
Quiz Quiz Hota Hai (School Quiz 2018-19)Quiz Quiz Hota Hai (School Quiz 2018-19)
Quiz Quiz Hota Hai (School Quiz 2018-19)
 

HBaseConAsia2018 Track2-6: Scaling 30TB's of data lake with Apache HBase and Scala DSL in Production

  • 1. hosted by Scaling 30 TB’s of Data Lake with Apache HBase and Scala DSL at Production Chetan Khatri
  • 2. hosted by Who Am I Lead - Data Science, Technology Evangelist @ Accion labs India Pvt. Ltd. Contributor @ Apache Spark, Apache HBase, Elixir Lang, Spark HBase Connectors. Co-Authored University Curriculum @ University of Kachchh, India. Data Engineering @: Nazara Games, Eccella Corporation. Advisor - Data Science Lab, University of Kachchh, India. M.Sc. - Computer Science from University of Kachchh, India.
  • 3. hosted by Agenda 01 02 04 03 What is Apache HBase Why Apache HBase Apache Spark and Scala Apache Spark HBase Connector 05 Case Study: Retail Analytics Architecturing Fast Data Processing Platform to Scale 30 TB Data in Production
  • 4. hosted by Add the title •  Modules do not limit the size, number, interval, can be adjusted according to need •  Modules do not limit the size, number, interval, can be adjusted according to need •  Modules do not limit the size, number, interval, can be adjusted according to need Source: https://hbase.apache.org/ ●  Column-oriented NoSQL ●  Non-relational ●  Distributed database build on top of HDFS. ●  Modeled after Google’s BigTable. ●  Built for fault-tolerant application with billions/ trillions of rows and millions of columns. ●  Very low latency and near real-time random reads and random writes. ●  Replication, end-to-end checksums, automatic rebalancing with HDFS. ●  Compression ●  Bloom filters ●  MapReduce over HBase data. ●  Best at fetching rows by key, scanning ranges of rows with ordered partitioning. What is Apache HBase
  • 5. hosted by What is Apache Spark ? Source: https://spark.apache.org/ Structured Data / SQL - Spark SQL Graph Processing - GraphX Machine Learning - MLlib Streaming - Spark Streaming, Structured Streaming
  • 6. hosted by What is Scala ●  Scala is a modern multi-paradigm programming language designed to express common programming patterns in a concise, elegant, and type-safe way. ●  Scala is object-oriented ●  Scala is functional ●  Strongly typed, Type Inference ●  Higher Order Functions ●  Lazy Computation Source: www.scala-lang.org
  • 7. hosted by Case Story: Retail Analytics Architecting Fast Data Processing Platform to Scale 30 TB of Data in Production Use cases in Retail Analytics: Business: explain the who, what, when, where, why and how they are doing Retailing. ●  What is selling as compared to what was being ordered. ●  Effective promotions - right promotions at right outlet and right time. ●  What types of Cigarette consumers are shopping in your outlets ? ○  Gives smoking patterns in specific geography, predict demand on supply. ●  What are the purchasing patterns of your consumers ? ○  are they purchasing Pizza and Ice cream together ? ○  are they purchasing multiple Instant food products with soda together ? ●  Time Series problem - year, month, day of year, week of year to Identify which brands are not getting sold at specific geography, so it can be swap to other store.
  • 8. hosted by Case Story: Retail Analytics - Scale Challenges ²  Weekly Data refresh, Daily Data refresh batch Spark / Hadoop job execution failures with unutilized Spark Cluster. ²  Scalability of massive data: ○  ~4.6 Billion events on every weekly data refresh. ○  Processing historical data: one time bulk load ~30 TB of data / ~17 billion transactional records. ²  Linear / sequential execution mode with broken data pipelines. ²  Joining ~17 billion transactional records with skewed data nature. ²  Data deduplication - outlet, item at retail.
  • 9. hosted by Using HBase as a MDM System MDM - Master Data Management 1. HBase Driven Data Deduplication Algorithms Example, ²  Outlet Matching ²  Item Matching ²  Address Matching ²  Brand Matching 2. Abbreviation Standardization Example, ²  UOM Standardization ²  Outlet Name, Address Standardization ²  UPC Standardization UOM Quantity PACK 2 2PACK NA 2PK
  • 10. hosted by Using HBase as a MDM System Data Deduplication Problem Retail Analytics ! Examples, ●  You may find Item with same UPC code. ●  You may find Outlet with same Outlet number. ●  What if UPC Code gets upgraded from 10 Digits to 14 Digits. (Update everywhere, needs faster update.)
  • 11. hosted by HBase - NoSQL, Denormalized columnar schema model For example, Table: transactional_line_item Column Family: f Column Family: o Column Family: t created_datetime file_id transaction_id manufacturer_operator_submitter_id Outlet_id Outlet_name Outlet_state Outlet_city Outlet_address1 Outlet_address2 Outlet_region Outlet_country Outlet_owner_name Outlet_status Outlet_zipcode Outlet_started_date outlet_sub_chain_name Transaction_id Item_id Promo_code Gross_price Discount Quantity Upc Uom State_gst Country_gst Delivery_charges Packing_charges Vendor_discount Partner_discount Corporate_discount reward_point_discount
  • 12. hosted by Case Story: Retail Analytics - Scale -  5x performance improvements by re-engineering entire data lake to analytical engine pipeline’s. -  Proposed highly concurrent, elastic, non-blocking, asynchronous architecture to save customer’s ~22 hours runtime (~8 hours from 30 hours) for 4.6 Billion events. -  10x performance improvements on historical load by under the hood algorithms optimization on 17 Billion events (Benchmarks ~1 hour execution time only) -  Master Data Management (MDM) - Deduplication and fuzzy logic matching on retails data(Item, Outlet) improved elastic performance. -  Using HBase as a Master Data Management (MDM) System.
  • 13. hosted by Case Story: Retail Analytics How ?
  • 14. hosted by Data Platform - Infrastructure Architecture Oracle POS Files Kafka Spark Streaming HBase Hive / Spark / Presto Spark MLLib TensorFlow Elastic Search Akka-HTTP HDFS Real time query HBase Staging Hive Aggregation Hive Layer Read Rec by Rec and do match [2] maprcli Index and setup replication [1] Asynchronous HBase: https://github.com/OpenTSDB/asynchbase [2] Index MapR-DB Data into Elasticsearch https://community.mapr.com/community/exchange/blog/ 2016/12/12/how-to-index-mapr-db-data-into-elasticsearch-on-aws NodeJS [3] NodeJS HBase https://www.npmjs.com/package/ node-thrift2-hbase https://www.npmjs.com/package/ async https://www.npmjs.com/package/ thrift
  • 16. hosted by Data Processing Infrastructure ●  MapR Distribution - http://archive.mapr.com/releases/ ecosystem-5.x/redhat/ ●  Data Lake - Apache HBase 0.98.12 ●  EDW / Analytical Data store - Apache Hive 1.2.1 ●  Unified Execution Engine - Apache Spark 2.0.1 ●  Distributed File storage - MapR-FS ●  Queueing mechanism - Apache Kafka ●  Streaming - HTTP Akka + scala 2.11 , Spark Streaming ●  Reporting Database - PostgreSQL 9.x ●  Legacy Database - Oracle 9 ●  Workflow Management tool - BMC Control M
  • 17. hosted by Rethink - Fast Data Architectures. Unify, Simplify. UNIFIED fast data processing engine that provides: The SCALE of data lake The RELIABILITY & PERFORMANCE of data warehouse The LOW LATENCY of streaming
  • 18. hosted by Spark HBase Connector. Credit: Contributors https://github.com/nerdammer/spark-hbase- connector
  • 19. hosted by Spark HBase Connector. It’s a Spark package connector written on top of Java HBase API. A simple and elegant way to write Spark - HBase Jobs. Powerful Functional Scala DSL integrated for Apache Spark. Supports: ●  Scala > 2.10 ●  Spark > 1.6 ●  HBase > 1.0
  • 20. hosted by Spark HBase Connector. Dependency in build.sbt - libraryDependencies += "it.nerdammer.bigdata" % "spark-hbase-connector_2.10" % "1.0.3" Setting the HBase Host
  • 24. hosted by Manage Empty Columns with Option[T]
  • 25. hosted by Custom Mapping with Case Classes
  • 26. hosted by Custom Mapping with Case Classes ...
  • 28. hosted by Implicit Reader ... Do not forget to override the columns method.
  • 29. hosted by HBase Read table Data in Spark DataFrame
  • 30. hosted by HBase Implicit Field Writer
  • 32. hosted by Reference [1] Apache HBase – Apache HBase™ Home URL: https://hbase.apache.org/ [2] Architecting HBase Applications: A Guidebook for successful development and design by Jean-Marc Spaggiari & Kevin O’Dell. [3] AsyncHBase URL: https://github.com/OpenTSDB/asynchbase [4] Spark HBase Connector URL: https://github.com/nerdammer/spark-hbase-connector [5] NodeJS Thirft2 HBase package URL: https://www.npmjs.com/package/node-thrift2-hbase [6] NodeJS Async URL: https://www.npmjs.com/package/async [7] Akka Actor URL: https://mvnrepository.com/artifact/com.typesafe.akka/akka-actor [8] Akka HTTP Core URL: https://mvnrepository.com/artifact/com.typesafe.akka/akka-http-core_2.11/10.0.1 [9] Akka HTTP Spray JSON URL: https://mvnrepository.com/artifact/com.typesafe.akka/akka-http-spray-json-experimental
  • 33. hosted by Reference [10] Kafka Clients URL: https://mvnrepository.com/artifact/org.apache.kafka/kafka-clients [11] Spray JSON URL: https://mvnrepository.com/artifact/io.spray/spray-json [12] Spray JSON Shapeless URL: https://mvnrepository.com/artifact/com.github.fommil/spray-json-shapeless [13] Scalamock scalatest URL: https://mvnrepository.com/artifact/org.scalamock/scalamock-scalatest-support