SlideShare a Scribd company logo
1 of 37
Design Patterns for 360º Views
using HBase and Kiji
Jonathan Natkins
Who am I?
Jon “Natty” Natkins
Field Engineer at WibiData
Formerly at Cloudera/Vertica
What is a 360º View?
What is a 360º View For?
Past
What interactions has a customer had in the past?
Present
What is the customer doing right now?
Future
What is the customer likely do to next?
Past and present inform the future
What If I Don’t Care About
Customers?
Generalizing the 360º View:
Entity-Centric Systems
Goal of an Entity-Centric
System
“Show me everything I know
about Natty”
What Data Do I Need to Store?
Static data
Event-oriented data
Derived data
Building Entity-Centric Systems
Often, this is an EDW with a star schema
Fact
Dim
Dim
Dim
Dim
Challenges With Star Schemas
How do we answer the original question?
Full table scan + joins
OLTP systems will likely fall over from the
volume
OLAP systems are usually not optimized for
single-row lookups
Need Something
Else…
Why
HBase rows can store both static and
event-oriented data
Cell versions are key
Single-row lookups are extremely fast
is for Building
Entity-Centric Systems
Often used for:
Building recommendation systems
Personalized search
Real-time HBase applications
Underlying technologies:
Designing an Entity-Centric
Datastore
Ask yourself this: what is the entity?
Determine your entity by determining how
you want to analyze the data
It’s ok to have data organized in multiple
ways
Schema Management with Kiji
Sometimes you actually want a schema layer
Defining a schema allows for data discoverability
Column Families in Kiji
Kiji has two types of column families
Group families are similar to relational
tables
Predefined set of columns
Each column has its own data type
Map families specify columns at runtime
Every column has the same data type
sessions:23
45
sessions:23
45
sessions:23
45
sessions:12
34
sessions:12
34
info:purchase
s
Knowing When To Use Different
Family Types
Do you know all of your columns up front?
Then use a group family
Map families are for when you don’t know
your columns ahead of time
info:name info:email
sessions:12
34
sessions:23
45
info:purchase
s
info:purchase
s
Choosing a Row Key
Row keys in Kiji are componentized
[ ‘component1’, ‘component2’, 1234 ]
More efficient than byte arrays
Consider ‘1234567890’ versus [ 1234567890 ]
Good for scanning areas of the keyspace
A Common Use for
Components
Known users IDs versus unknown IDs
On a website, how do you differentiate
between a logged-in or cookie’d user versus a
brand new visitor
[ ‘K’, ‘user1234’ ] or [ ‘U’, ‘unknown2345’ ]
Physically and logically separate rows
Run jobs over all known or unknown users
Identifying Known Users
Problem: Users have many cookies over
time.
Challenge: Ideally, we would have a single
row for each user. How do we ensure that
new data goes to the right row?
Finding Known Users With
Lookup Tables
HBase get operations are fast
It’s easy enough to create a table that
contains a mapping of cookies to known
user IDs
When data is loaded, check the lookup
table to determine if you should write data
to an existing row or a new one
Avoiding Hotspots
Unhashed Row Keys
Node 1 Node 2 Node 3
Region
A-B
Region
B-C
Region
D-E
Region
F-G
Region
H-I
Region
J-K
Hash-Prefixed Row Keys
Node 1 Node 2 Node 3
Region
00A-0fK
Region
10A-1fK
Region
20A-2fK
Region
30A-3fK
Region
40A-4fK
Region
50A-5fK
Storing Event Series
360º views need easy access to all the
transactions and events for a user
HBase cells may contain more than one
version
Kiji leverages this to store event series
data like clicks or purchases sessions:23
45
sessions:23
45
sessions:23
45
sessions:12
34
sessions:12
34
info:purchase
sinfo:name info:email
sessions:12
34
sessions:23
45
info:purchase
s
info:purchase
s
How Many Events is Too Many?
The HBase book warns that too many
versions of a cell can cause StoreFile
bloat
HBase will never split a row
Common tactic is to add a timestamp
range to the row key
Kiji makes this easy with componentized row
Beware of Timestamp Misuse
A major reason the HBase book warns
against mucking with timestamps is that
they can be dangerous
What happens if you use a sequence number
as a timestamp? Think about TTLs
Iterate and Evolve
Why is Evolution Necessary?
No entity-centric system will be the end-all,
be-all the first time around
Data sources in large enterprises are
usually heavily silo’d
Start small
Incorporate new data sources over time
Putting it Together
Kiji includes a shell to use DDL to create
tables
Many of the features that have been
discussed are declarative via the DDL
Users Table
CREATE TABLE ’user_events' WITH DESCRIPTION 'Events table for online users.'
ROW KEY FORMAT (type STRING, user_id STRING NOT NULL, HASH(THROUGH user_id))
PROPERTIES (NUMREGIONS = 32)
WITH LOCALITY GROUP default WITH DESCRIPTION 'main storage' (
MAXVERSIONS = INFINITY,
TTL = FOREVER,
INMEMORY = false,
MAP TYPE FAMILY events CLASS com.kiji.avro.Event WITH DESCRIPTION 'events'
),
LOCALITY GROUP memory WITH DESCRIPTION 'recs storage' (
MAXVERSIONS = 10,
TTL = FOREVER,
INMEMORY = true,
FAMILY recs (
recommended CLASS com.kiji.avro.ProductRecList
WITH DESCRIPTION 'Recommended products.'
)
);
Users Table
CREATE TABLE ’user_events' WITH DESCRIPTION 'Events table for online users.'
ROW KEY FORMAT (type STRING, user_id STRING NOT
NULL, HASH(THROUGH user_id))
PROPERTIES (NUMREGIONS = 32)
WITH LOCALITY GROUP default WITH DESCRIPTION 'main storage' (
MAXVERSIONS = INFINITY,
TTL = FOREVER,
INMEMORY = false,
MAP TYPE FAMILY events CLASS com.kiji.avro.Event WITH DESCRIPTION 'events'
),
LOCALITY GROUP memory WITH DESCRIPTION 'recs storage' (
MAXVERSIONS = 10,
TTL = FOREVER,
INMEMORY = true,
FAMILY recs (
recommended CLASS com.kiji.avro.ProductRecList
WITH DESCRIPTION 'Recommended products.'
)
);
Users TableCREATE TABLE ’user_events' WITH DESCRIPTION 'Events table for online users.'
ROW KEY FORMAT (type STRING, user_id STRING NOT NULL, HASH(THROUGH user_id))
PROPERTIES (NUMREGIONS = 32)
WITH LOCALITY GROUP default
WITH DESCRIPTION 'main storage' (
MAXVERSIONS = INFINITY,
TTL = FOREVER,
INMEMORY = false,
MAP TYPE FAMILY events CLASS
com.kiji.avro.Event
WITH DESCRIPTION 'events'
),
LOCALITY GROUP memory WITH DESCRIPTION 'recs storage' (
MAXVERSIONS = 10,
TTL = FOREVER,
INMEMORY = true,
FAMILY recs (
recommended CLASS com.kiji.avro.ProductRecList
WITH DESCRIPTION 'Recommended products.'
)
);
Users Table
CREATE TABLE ’user_events' WITH DESCRIPTION 'Events table for online users.'
ROW KEY FORMAT (type STRING, user_id STRING NOT NULL, HASH(THROUGH user_id))
PROPERTIES (NUMREGIONS = 32)
WITH LOCALITY GROUP default WITH DESCRIPTION 'main storage' (
MAXVERSIONS = INFINITY,
TTL = FOREVER,
INMEMORY = false,
MAP TYPE FAMILY events CLASS com.kiji.avro.Event WITH DESCRIPTION 'events'
),
LOCALITY GROUP memory WITH DESCRIPTION 'recs storage' (
MAXVERSIONS = 10,
TTL = FOREVER,
INMEMORY = true,
FAMILY recs (
recommended CLASS
com.kiji.avro.ProductRecList
WITH DESCRIPTION 'Recommended products.’
)
);
In Summary…
Designing applications in an entity-centric
fashion can make them easier to build and
more efficient
Kiji can speed up the development
process of 360º views
Questions?
Contact me
natty@wibidata.com
@nattyice
The Kiji Project: kiji.org

More Related Content

What's hot

How to Build Modern Data Architectures Both On Premises and in the Cloud
How to Build Modern Data Architectures Both On Premises and in the CloudHow to Build Modern Data Architectures Both On Premises and in the Cloud
How to Build Modern Data Architectures Both On Premises and in the Cloud
VMware Tanzu
 
Hadoop Journey at Walgreens
Hadoop Journey at WalgreensHadoop Journey at Walgreens
Hadoop Journey at Walgreens
DataWorks Summit
 
Qubole hadoop-summit-2013-europe
Qubole hadoop-summit-2013-europeQubole hadoop-summit-2013-europe
Qubole hadoop-summit-2013-europe
Joydeep Sen Sarma
 

What's hot (20)

Real-time user profiling based on Spark streaming and HBase by Arkadiusz Jach...
Real-time user profiling based on Spark streaming and HBase by Arkadiusz Jach...Real-time user profiling based on Spark streaming and HBase by Arkadiusz Jach...
Real-time user profiling based on Spark streaming and HBase by Arkadiusz Jach...
 
Architecting a datalake
Architecting a datalakeArchitecting a datalake
Architecting a datalake
 
How to Build Modern Data Architectures Both On Premises and in the Cloud
How to Build Modern Data Architectures Both On Premises and in the CloudHow to Build Modern Data Architectures Both On Premises and in the Cloud
How to Build Modern Data Architectures Both On Premises and in the Cloud
 
HBaseCon 2012 | Developing Real Time Analytics Applications Using HBase in th...
HBaseCon 2012 | Developing Real Time Analytics Applications Using HBase in th...HBaseCon 2012 | Developing Real Time Analytics Applications Using HBase in th...
HBaseCon 2012 | Developing Real Time Analytics Applications Using HBase in th...
 
Big data in Azure
Big data in AzureBig data in Azure
Big data in Azure
 
The Microsoft BigData Story
The Microsoft BigData StoryThe Microsoft BigData Story
The Microsoft BigData Story
 
Intuit Analytics Cloud 101
Intuit Analytics Cloud 101Intuit Analytics Cloud 101
Intuit Analytics Cloud 101
 
Introduction to DataStax Enterprise Graph Database
Introduction to DataStax Enterprise Graph DatabaseIntroduction to DataStax Enterprise Graph Database
Introduction to DataStax Enterprise Graph Database
 
Big Data Day LA 2016/ Use Case Driven track - Hydrator: Open Source, Code-Fre...
Big Data Day LA 2016/ Use Case Driven track - Hydrator: Open Source, Code-Fre...Big Data Day LA 2016/ Use Case Driven track - Hydrator: Open Source, Code-Fre...
Big Data Day LA 2016/ Use Case Driven track - Hydrator: Open Source, Code-Fre...
 
Hadoop data access layer v4.0
Hadoop data access layer v4.0Hadoop data access layer v4.0
Hadoop data access layer v4.0
 
Treasure Data From MySQL to Redshift
Treasure Data  From MySQL to RedshiftTreasure Data  From MySQL to Redshift
Treasure Data From MySQL to Redshift
 
Data Privacy with Apache Spark: Defensive and Offensive Approaches
Data Privacy with Apache Spark: Defensive and Offensive ApproachesData Privacy with Apache Spark: Defensive and Offensive Approaches
Data Privacy with Apache Spark: Defensive and Offensive Approaches
 
Big Data in the Real World
Big Data in the Real WorldBig Data in the Real World
Big Data in the Real World
 
Hadoop Journey at Walgreens
Hadoop Journey at WalgreensHadoop Journey at Walgreens
Hadoop Journey at Walgreens
 
Data Modeling a Scheduling App (Adam Hutson, DataScale) | Cassandra Summit 2016
Data Modeling a Scheduling App (Adam Hutson, DataScale) | Cassandra Summit 2016Data Modeling a Scheduling App (Adam Hutson, DataScale) | Cassandra Summit 2016
Data Modeling a Scheduling App (Adam Hutson, DataScale) | Cassandra Summit 2016
 
Building Data Intensive Analytic Application on Top of Delta Lakes
Building Data Intensive Analytic Application on Top of Delta LakesBuilding Data Intensive Analytic Application on Top of Delta Lakes
Building Data Intensive Analytic Application on Top of Delta Lakes
 
Discovery & Consumption of Analytics Data @Twitter
Discovery & Consumption of Analytics Data @TwitterDiscovery & Consumption of Analytics Data @Twitter
Discovery & Consumption of Analytics Data @Twitter
 
GCP Data Engineer cheatsheet
GCP Data Engineer cheatsheetGCP Data Engineer cheatsheet
GCP Data Engineer cheatsheet
 
Qubole hadoop-summit-2013-europe
Qubole hadoop-summit-2013-europeQubole hadoop-summit-2013-europe
Qubole hadoop-summit-2013-europe
 
Encryption and Masking for Sensitive Apache Spark Analytics Addressing CCPA a...
Encryption and Masking for Sensitive Apache Spark Analytics Addressing CCPA a...Encryption and Masking for Sensitive Apache Spark Analytics Addressing CCPA a...
Encryption and Masking for Sensitive Apache Spark Analytics Addressing CCPA a...
 

Viewers also liked

Viewers also liked (20)

HBaseCon 2012 | Low Latency OLAP with HBase - Cosmin Lehene, Adobe
HBaseCon 2012 | Low Latency OLAP with HBase - Cosmin Lehene, AdobeHBaseCon 2012 | Low Latency OLAP with HBase - Cosmin Lehene, Adobe
HBaseCon 2012 | Low Latency OLAP with HBase - Cosmin Lehene, Adobe
 
Bulk Loading in the Wild: Ingesting the World's Energy Data
Bulk Loading in the Wild: Ingesting the World's Energy DataBulk Loading in the Wild: Ingesting the World's Energy Data
Bulk Loading in the Wild: Ingesting the World's Energy Data
 
HBaseCon 2012 | Storing and Manipulating Graphs in HBase
HBaseCon 2012 | Storing and Manipulating Graphs in HBaseHBaseCon 2012 | Storing and Manipulating Graphs in HBase
HBaseCon 2012 | Storing and Manipulating Graphs in HBase
 
Breaking the Sound Barrier with Persistent Memory
Breaking the Sound Barrier with Persistent Memory Breaking the Sound Barrier with Persistent Memory
Breaking the Sound Barrier with Persistent Memory
 
HBase Data Modeling and Access Patterns with Kite SDK
HBase Data Modeling and Access Patterns with Kite SDKHBase Data Modeling and Access Patterns with Kite SDK
HBase Data Modeling and Access Patterns with Kite SDK
 
Keynote: The Future of Apache HBase
Keynote: The Future of Apache HBaseKeynote: The Future of Apache HBase
Keynote: The Future of Apache HBase
 
Apache HBase Improvements and Practices at Xiaomi
Apache HBase Improvements and Practices at XiaomiApache HBase Improvements and Practices at Xiaomi
Apache HBase Improvements and Practices at Xiaomi
 
Apache HBase at Airbnb
Apache HBase at Airbnb Apache HBase at Airbnb
Apache HBase at Airbnb
 
Content Identification using HBase
Content Identification using HBaseContent Identification using HBase
Content Identification using HBase
 
New Security Features in Apache HBase 0.98: An Operator's Guide
New Security Features in Apache HBase 0.98: An Operator's GuideNew Security Features in Apache HBase 0.98: An Operator's Guide
New Security Features in Apache HBase 0.98: An Operator's Guide
 
Apache HBase - Just the Basics
Apache HBase - Just the BasicsApache HBase - Just the Basics
Apache HBase - Just the Basics
 
HBase: Just the Basics
HBase: Just the BasicsHBase: Just the Basics
HBase: Just the Basics
 
Optimizing Apache HBase for Cloud Storage in Microsoft Azure HDInsight
Optimizing Apache HBase for Cloud Storage in Microsoft Azure HDInsightOptimizing Apache HBase for Cloud Storage in Microsoft Azure HDInsight
Optimizing Apache HBase for Cloud Storage in Microsoft Azure HDInsight
 
Intro to HBase Internals & Schema Design (for HBase users)
Intro to HBase Internals & Schema Design (for HBase users)Intro to HBase Internals & Schema Design (for HBase users)
Intro to HBase Internals & Schema Design (for HBase users)
 
HBaseCon 2012 | HBase powered Merchant Lookup Service at Intuit
HBaseCon 2012 | HBase powered Merchant Lookup Service at IntuitHBaseCon 2012 | HBase powered Merchant Lookup Service at Intuit
HBaseCon 2012 | HBase powered Merchant Lookup Service at Intuit
 
HBaseCon 2012 | HBase and HDFS: Past, Present, Future - Todd Lipcon, Cloudera
HBaseCon 2012 | HBase and HDFS: Past, Present, Future - Todd Lipcon, ClouderaHBaseCon 2012 | HBase and HDFS: Past, Present, Future - Todd Lipcon, Cloudera
HBaseCon 2012 | HBase and HDFS: Past, Present, Future - Todd Lipcon, Cloudera
 
Search2012 ibm vf
Search2012 ibm vfSearch2012 ibm vf
Search2012 ibm vf
 
Streaming map reduce
Streaming map reduceStreaming map reduce
Streaming map reduce
 
HBase In Action - Chapter 04: HBase table design
HBase In Action - Chapter 04: HBase table designHBase In Action - Chapter 04: HBase table design
HBase In Action - Chapter 04: HBase table design
 
Apache HBase 0.98
Apache HBase 0.98Apache HBase 0.98
Apache HBase 0.98
 

Similar to Design Patterns for Building 360-degree Views with HBase and Kiji

Presentation_BigData_NenaMarin
Presentation_BigData_NenaMarinPresentation_BigData_NenaMarin
Presentation_BigData_NenaMarin
n5712036
 

Similar to Design Patterns for Building 360-degree Views with HBase and Kiji (20)

Big data and machine learning / Gil Chamiel
Big data and machine learning / Gil Chamiel   Big data and machine learning / Gil Chamiel
Big data and machine learning / Gil Chamiel
 
How we evolved data pipeline at Celtra and what we learned along the way
How we evolved data pipeline at Celtra and what we learned along the wayHow we evolved data pipeline at Celtra and what we learned along the way
How we evolved data pipeline at Celtra and what we learned along the way
 
Big Data, Bigger Brains
Big Data, Bigger BrainsBig Data, Bigger Brains
Big Data, Bigger Brains
 
Tools and Tips: From Accidental to Efficient Data Warehouse Developer (SQLSat...
Tools and Tips: From Accidental to Efficient Data Warehouse Developer (SQLSat...Tools and Tips: From Accidental to Efficient Data Warehouse Developer (SQLSat...
Tools and Tips: From Accidental to Efficient Data Warehouse Developer (SQLSat...
 
Microsoft Azure Big Data Analytics
Microsoft Azure Big Data AnalyticsMicrosoft Azure Big Data Analytics
Microsoft Azure Big Data Analytics
 
Gits class #22: [ONLINE] Analyze Your User's Activities Using BigQuery and Da...
Gits class #22: [ONLINE] Analyze Your User's Activities Using BigQuery and Da...Gits class #22: [ONLINE] Analyze Your User's Activities Using BigQuery and Da...
Gits class #22: [ONLINE] Analyze Your User's Activities Using BigQuery and Da...
 
Agile Testing Days 2017 Introducing AgileBI Sustainably
Agile Testing Days 2017 Introducing AgileBI SustainablyAgile Testing Days 2017 Introducing AgileBI Sustainably
Agile Testing Days 2017 Introducing AgileBI Sustainably
 
[WSO2Con Asia 2018] Patterns for Building Streaming Apps
[WSO2Con Asia 2018] Patterns for Building Streaming Apps[WSO2Con Asia 2018] Patterns for Building Streaming Apps
[WSO2Con Asia 2018] Patterns for Building Streaming Apps
 
Async
AsyncAsync
Async
 
Keynote - Speaker: Grigori Melnik
Keynote - Speaker: Grigori Melnik Keynote - Speaker: Grigori Melnik
Keynote - Speaker: Grigori Melnik
 
Real-time serverless analytics at Shedd – OLX data summit, Mar 2018, Barcelona
Real-time serverless analytics at Shedd – OLX data summit, Mar 2018, BarcelonaReal-time serverless analytics at Shedd – OLX data summit, Mar 2018, Barcelona
Real-time serverless analytics at Shedd – OLX data summit, Mar 2018, Barcelona
 
Presentation_BigData_NenaMarin
Presentation_BigData_NenaMarinPresentation_BigData_NenaMarin
Presentation_BigData_NenaMarin
 
Freeing Yourself from an RDBMS Architecture
Freeing Yourself from an RDBMS ArchitectureFreeing Yourself from an RDBMS Architecture
Freeing Yourself from an RDBMS Architecture
 
Performance By Design
Performance By DesignPerformance By Design
Performance By Design
 
Xomia_20220602.pptx
Xomia_20220602.pptxXomia_20220602.pptx
Xomia_20220602.pptx
 
Telecom datascience master_public
Telecom datascience master_publicTelecom datascience master_public
Telecom datascience master_public
 
Retail referencearchitecture productcatalog
Retail referencearchitecture productcatalogRetail referencearchitecture productcatalog
Retail referencearchitecture productcatalog
 
Self-serve analytics journey at Celtra: Snowflake, Spark, and Databricks
Self-serve analytics journey at Celtra: Snowflake, Spark, and DatabricksSelf-serve analytics journey at Celtra: Snowflake, Spark, and Databricks
Self-serve analytics journey at Celtra: Snowflake, Spark, and Databricks
 
Indic threads pune12-nosql now and path ahead
Indic threads pune12-nosql now and path aheadIndic threads pune12-nosql now and path ahead
Indic threads pune12-nosql now and path ahead
 
Real-time big data analytics based on product recommendations case study
Real-time big data analytics based on product recommendations case studyReal-time big data analytics based on product recommendations case study
Real-time big data analytics based on product recommendations case study
 

More from HBaseCon

More from HBaseCon (20)

hbaseconasia2017: Building online HBase cluster of Zhihu based on Kubernetes
hbaseconasia2017: Building online HBase cluster of Zhihu based on Kuberneteshbaseconasia2017: Building online HBase cluster of Zhihu based on Kubernetes
hbaseconasia2017: Building online HBase cluster of Zhihu based on Kubernetes
 
hbaseconasia2017: HBase on Beam
hbaseconasia2017: HBase on Beamhbaseconasia2017: HBase on Beam
hbaseconasia2017: HBase on Beam
 
hbaseconasia2017: HBase Disaster Recovery Solution at Huawei
hbaseconasia2017: HBase Disaster Recovery Solution at Huaweihbaseconasia2017: HBase Disaster Recovery Solution at Huawei
hbaseconasia2017: HBase Disaster Recovery Solution at Huawei
 
hbaseconasia2017: Removable singularity: a story of HBase upgrade in Pinterest
hbaseconasia2017: Removable singularity: a story of HBase upgrade in Pinteresthbaseconasia2017: Removable singularity: a story of HBase upgrade in Pinterest
hbaseconasia2017: Removable singularity: a story of HBase upgrade in Pinterest
 
hbaseconasia2017: HareQL:快速HBase查詢工具的發展過程
hbaseconasia2017: HareQL:快速HBase查詢工具的發展過程hbaseconasia2017: HareQL:快速HBase查詢工具的發展過程
hbaseconasia2017: HareQL:快速HBase查詢工具的發展過程
 
hbaseconasia2017: Apache HBase at Netease
hbaseconasia2017: Apache HBase at Neteasehbaseconasia2017: Apache HBase at Netease
hbaseconasia2017: Apache HBase at Netease
 
hbaseconasia2017: HBase在Hulu的使用和实践
hbaseconasia2017: HBase在Hulu的使用和实践hbaseconasia2017: HBase在Hulu的使用和实践
hbaseconasia2017: HBase在Hulu的使用和实践
 
hbaseconasia2017: 基于HBase的企业级大数据平台
hbaseconasia2017: 基于HBase的企业级大数据平台hbaseconasia2017: 基于HBase的企业级大数据平台
hbaseconasia2017: 基于HBase的企业级大数据平台
 
hbaseconasia2017: HBase at JD.com
hbaseconasia2017: HBase at JD.comhbaseconasia2017: HBase at JD.com
hbaseconasia2017: HBase at JD.com
 
hbaseconasia2017: Large scale data near-line loading method and architecture
hbaseconasia2017: Large scale data near-line loading method and architecturehbaseconasia2017: Large scale data near-line loading method and architecture
hbaseconasia2017: Large scale data near-line loading method and architecture
 
hbaseconasia2017: Ecosystems with HBase and CloudTable service at Huawei
hbaseconasia2017: Ecosystems with HBase and CloudTable service at Huaweihbaseconasia2017: Ecosystems with HBase and CloudTable service at Huawei
hbaseconasia2017: Ecosystems with HBase and CloudTable service at Huawei
 
hbaseconasia2017: HBase Practice At XiaoMi
hbaseconasia2017: HBase Practice At XiaoMihbaseconasia2017: HBase Practice At XiaoMi
hbaseconasia2017: HBase Practice At XiaoMi
 
hbaseconasia2017: hbase-2.0.0
hbaseconasia2017: hbase-2.0.0hbaseconasia2017: hbase-2.0.0
hbaseconasia2017: hbase-2.0.0
 
HBaseCon2017 Democratizing HBase
HBaseCon2017 Democratizing HBaseHBaseCon2017 Democratizing HBase
HBaseCon2017 Democratizing HBase
 
HBaseCon2017 Removable singularity: a story of HBase upgrade in Pinterest
HBaseCon2017 Removable singularity: a story of HBase upgrade in PinterestHBaseCon2017 Removable singularity: a story of HBase upgrade in Pinterest
HBaseCon2017 Removable singularity: a story of HBase upgrade in Pinterest
 
HBaseCon2017 Quanta: Quora's hierarchical counting system on HBase
HBaseCon2017 Quanta: Quora's hierarchical counting system on HBaseHBaseCon2017 Quanta: Quora's hierarchical counting system on HBase
HBaseCon2017 Quanta: Quora's hierarchical counting system on HBase
 
HBaseCon2017 Transactions in HBase
HBaseCon2017 Transactions in HBaseHBaseCon2017 Transactions in HBase
HBaseCon2017 Transactions in HBase
 
HBaseCon2017 Highly-Available HBase
HBaseCon2017 Highly-Available HBaseHBaseCon2017 Highly-Available HBase
HBaseCon2017 Highly-Available HBase
 
HBaseCon2017 Apache HBase at Didi
HBaseCon2017 Apache HBase at DidiHBaseCon2017 Apache HBase at Didi
HBaseCon2017 Apache HBase at Didi
 
HBaseCon2017 gohbase: Pure Go HBase Client
HBaseCon2017 gohbase: Pure Go HBase ClientHBaseCon2017 gohbase: Pure Go HBase Client
HBaseCon2017 gohbase: Pure Go HBase Client
 

Recently uploaded

introduction-to-automotive Andoid os-csimmonds-ndctechtown-2021.pdf
introduction-to-automotive Andoid os-csimmonds-ndctechtown-2021.pdfintroduction-to-automotive Andoid os-csimmonds-ndctechtown-2021.pdf
introduction-to-automotive Andoid os-csimmonds-ndctechtown-2021.pdf
VishalKumarJha10
 

Recently uploaded (20)

10 Trends Likely to Shape Enterprise Technology in 2024
10 Trends Likely to Shape Enterprise Technology in 202410 Trends Likely to Shape Enterprise Technology in 2024
10 Trends Likely to Shape Enterprise Technology in 2024
 
Right Money Management App For Your Financial Goals
Right Money Management App For Your Financial GoalsRight Money Management App For Your Financial Goals
Right Money Management App For Your Financial Goals
 
Unveiling the Tech Salsa of LAMs with Janus in Real-Time Applications
Unveiling the Tech Salsa of LAMs with Janus in Real-Time ApplicationsUnveiling the Tech Salsa of LAMs with Janus in Real-Time Applications
Unveiling the Tech Salsa of LAMs with Janus in Real-Time Applications
 
Define the academic and professional writing..pdf
Define the academic and professional writing..pdfDefine the academic and professional writing..pdf
Define the academic and professional writing..pdf
 
Exploring the Best Video Editing App.pdf
Exploring the Best Video Editing App.pdfExploring the Best Video Editing App.pdf
Exploring the Best Video Editing App.pdf
 
Shapes for Sharing between Graph Data Spaces - and Epistemic Querying of RDF-...
Shapes for Sharing between Graph Data Spaces - and Epistemic Querying of RDF-...Shapes for Sharing between Graph Data Spaces - and Epistemic Querying of RDF-...
Shapes for Sharing between Graph Data Spaces - and Epistemic Querying of RDF-...
 
5 Signs You Need a Fashion PLM Software.pdf
5 Signs You Need a Fashion PLM Software.pdf5 Signs You Need a Fashion PLM Software.pdf
5 Signs You Need a Fashion PLM Software.pdf
 
How to Choose the Right Laravel Development Partner in New York City_compress...
How to Choose the Right Laravel Development Partner in New York City_compress...How to Choose the Right Laravel Development Partner in New York City_compress...
How to Choose the Right Laravel Development Partner in New York City_compress...
 
Unlocking the Future of AI Agents with Large Language Models
Unlocking the Future of AI Agents with Large Language ModelsUnlocking the Future of AI Agents with Large Language Models
Unlocking the Future of AI Agents with Large Language Models
 
8257 interfacing 2 in microprocessor for btech students
8257 interfacing 2 in microprocessor for btech students8257 interfacing 2 in microprocessor for btech students
8257 interfacing 2 in microprocessor for btech students
 
The Ultimate Test Automation Guide_ Best Practices and Tips.pdf
The Ultimate Test Automation Guide_ Best Practices and Tips.pdfThe Ultimate Test Automation Guide_ Best Practices and Tips.pdf
The Ultimate Test Automation Guide_ Best Practices and Tips.pdf
 
call girls in Vaishali (Ghaziabad) 🔝 >༒8448380779 🔝 genuine Escort Service 🔝✔️✔️
call girls in Vaishali (Ghaziabad) 🔝 >༒8448380779 🔝 genuine Escort Service 🔝✔️✔️call girls in Vaishali (Ghaziabad) 🔝 >༒8448380779 🔝 genuine Escort Service 🔝✔️✔️
call girls in Vaishali (Ghaziabad) 🔝 >༒8448380779 🔝 genuine Escort Service 🔝✔️✔️
 
AI & Machine Learning Presentation Template
AI & Machine Learning Presentation TemplateAI & Machine Learning Presentation Template
AI & Machine Learning Presentation Template
 
Microsoft AI Transformation Partner Playbook.pdf
Microsoft AI Transformation Partner Playbook.pdfMicrosoft AI Transformation Partner Playbook.pdf
Microsoft AI Transformation Partner Playbook.pdf
 
Introducing Microsoft’s new Enterprise Work Management (EWM) Solution
Introducing Microsoft’s new Enterprise Work Management (EWM) SolutionIntroducing Microsoft’s new Enterprise Work Management (EWM) Solution
Introducing Microsoft’s new Enterprise Work Management (EWM) Solution
 
How To Troubleshoot Collaboration Apps for the Modern Connected Worker
How To Troubleshoot Collaboration Apps for the Modern Connected WorkerHow To Troubleshoot Collaboration Apps for the Modern Connected Worker
How To Troubleshoot Collaboration Apps for the Modern Connected Worker
 
A Secure and Reliable Document Management System is Essential.docx
A Secure and Reliable Document Management System is Essential.docxA Secure and Reliable Document Management System is Essential.docx
A Secure and Reliable Document Management System is Essential.docx
 
HR Software Buyers Guide in 2024 - HRSoftware.com
HR Software Buyers Guide in 2024 - HRSoftware.comHR Software Buyers Guide in 2024 - HRSoftware.com
HR Software Buyers Guide in 2024 - HRSoftware.com
 
introduction-to-automotive Andoid os-csimmonds-ndctechtown-2021.pdf
introduction-to-automotive Andoid os-csimmonds-ndctechtown-2021.pdfintroduction-to-automotive Andoid os-csimmonds-ndctechtown-2021.pdf
introduction-to-automotive Andoid os-csimmonds-ndctechtown-2021.pdf
 
Learn the Fundamentals of XCUITest Framework_ A Beginner's Guide.pdf
Learn the Fundamentals of XCUITest Framework_ A Beginner's Guide.pdfLearn the Fundamentals of XCUITest Framework_ A Beginner's Guide.pdf
Learn the Fundamentals of XCUITest Framework_ A Beginner's Guide.pdf
 

Design Patterns for Building 360-degree Views with HBase and Kiji

  • 1. Design Patterns for 360º Views using HBase and Kiji Jonathan Natkins
  • 2. Who am I? Jon “Natty” Natkins Field Engineer at WibiData Formerly at Cloudera/Vertica
  • 3. What is a 360º View?
  • 4. What is a 360º View For? Past What interactions has a customer had in the past? Present What is the customer doing right now? Future What is the customer likely do to next? Past and present inform the future
  • 5. What If I Don’t Care About Customers?
  • 6. Generalizing the 360º View: Entity-Centric Systems
  • 7. Goal of an Entity-Centric System “Show me everything I know about Natty”
  • 8. What Data Do I Need to Store? Static data Event-oriented data Derived data
  • 9. Building Entity-Centric Systems Often, this is an EDW with a star schema Fact Dim Dim Dim Dim
  • 10. Challenges With Star Schemas How do we answer the original question? Full table scan + joins OLTP systems will likely fall over from the volume OLAP systems are usually not optimized for single-row lookups
  • 12.
  • 13. Why HBase rows can store both static and event-oriented data Cell versions are key Single-row lookups are extremely fast
  • 14. is for Building Entity-Centric Systems Often used for: Building recommendation systems Personalized search Real-time HBase applications Underlying technologies:
  • 15. Designing an Entity-Centric Datastore Ask yourself this: what is the entity? Determine your entity by determining how you want to analyze the data It’s ok to have data organized in multiple ways
  • 16. Schema Management with Kiji Sometimes you actually want a schema layer Defining a schema allows for data discoverability
  • 17. Column Families in Kiji Kiji has two types of column families Group families are similar to relational tables Predefined set of columns Each column has its own data type Map families specify columns at runtime Every column has the same data type
  • 18. sessions:23 45 sessions:23 45 sessions:23 45 sessions:12 34 sessions:12 34 info:purchase s Knowing When To Use Different Family Types Do you know all of your columns up front? Then use a group family Map families are for when you don’t know your columns ahead of time info:name info:email sessions:12 34 sessions:23 45 info:purchase s info:purchase s
  • 19. Choosing a Row Key Row keys in Kiji are componentized [ ‘component1’, ‘component2’, 1234 ] More efficient than byte arrays Consider ‘1234567890’ versus [ 1234567890 ] Good for scanning areas of the keyspace
  • 20. A Common Use for Components Known users IDs versus unknown IDs On a website, how do you differentiate between a logged-in or cookie’d user versus a brand new visitor [ ‘K’, ‘user1234’ ] or [ ‘U’, ‘unknown2345’ ] Physically and logically separate rows Run jobs over all known or unknown users
  • 21. Identifying Known Users Problem: Users have many cookies over time. Challenge: Ideally, we would have a single row for each user. How do we ensure that new data goes to the right row?
  • 22. Finding Known Users With Lookup Tables HBase get operations are fast It’s easy enough to create a table that contains a mapping of cookies to known user IDs When data is loaded, check the lookup table to determine if you should write data to an existing row or a new one
  • 24. Unhashed Row Keys Node 1 Node 2 Node 3 Region A-B Region B-C Region D-E Region F-G Region H-I Region J-K
  • 25. Hash-Prefixed Row Keys Node 1 Node 2 Node 3 Region 00A-0fK Region 10A-1fK Region 20A-2fK Region 30A-3fK Region 40A-4fK Region 50A-5fK
  • 26. Storing Event Series 360º views need easy access to all the transactions and events for a user HBase cells may contain more than one version Kiji leverages this to store event series data like clicks or purchases sessions:23 45 sessions:23 45 sessions:23 45 sessions:12 34 sessions:12 34 info:purchase sinfo:name info:email sessions:12 34 sessions:23 45 info:purchase s info:purchase s
  • 27. How Many Events is Too Many? The HBase book warns that too many versions of a cell can cause StoreFile bloat HBase will never split a row Common tactic is to add a timestamp range to the row key Kiji makes this easy with componentized row
  • 28. Beware of Timestamp Misuse A major reason the HBase book warns against mucking with timestamps is that they can be dangerous What happens if you use a sequence number as a timestamp? Think about TTLs
  • 30. Why is Evolution Necessary? No entity-centric system will be the end-all, be-all the first time around Data sources in large enterprises are usually heavily silo’d Start small Incorporate new data sources over time
  • 31. Putting it Together Kiji includes a shell to use DDL to create tables Many of the features that have been discussed are declarative via the DDL
  • 32. Users Table CREATE TABLE ’user_events' WITH DESCRIPTION 'Events table for online users.' ROW KEY FORMAT (type STRING, user_id STRING NOT NULL, HASH(THROUGH user_id)) PROPERTIES (NUMREGIONS = 32) WITH LOCALITY GROUP default WITH DESCRIPTION 'main storage' ( MAXVERSIONS = INFINITY, TTL = FOREVER, INMEMORY = false, MAP TYPE FAMILY events CLASS com.kiji.avro.Event WITH DESCRIPTION 'events' ), LOCALITY GROUP memory WITH DESCRIPTION 'recs storage' ( MAXVERSIONS = 10, TTL = FOREVER, INMEMORY = true, FAMILY recs ( recommended CLASS com.kiji.avro.ProductRecList WITH DESCRIPTION 'Recommended products.' ) );
  • 33. Users Table CREATE TABLE ’user_events' WITH DESCRIPTION 'Events table for online users.' ROW KEY FORMAT (type STRING, user_id STRING NOT NULL, HASH(THROUGH user_id)) PROPERTIES (NUMREGIONS = 32) WITH LOCALITY GROUP default WITH DESCRIPTION 'main storage' ( MAXVERSIONS = INFINITY, TTL = FOREVER, INMEMORY = false, MAP TYPE FAMILY events CLASS com.kiji.avro.Event WITH DESCRIPTION 'events' ), LOCALITY GROUP memory WITH DESCRIPTION 'recs storage' ( MAXVERSIONS = 10, TTL = FOREVER, INMEMORY = true, FAMILY recs ( recommended CLASS com.kiji.avro.ProductRecList WITH DESCRIPTION 'Recommended products.' ) );
  • 34. Users TableCREATE TABLE ’user_events' WITH DESCRIPTION 'Events table for online users.' ROW KEY FORMAT (type STRING, user_id STRING NOT NULL, HASH(THROUGH user_id)) PROPERTIES (NUMREGIONS = 32) WITH LOCALITY GROUP default WITH DESCRIPTION 'main storage' ( MAXVERSIONS = INFINITY, TTL = FOREVER, INMEMORY = false, MAP TYPE FAMILY events CLASS com.kiji.avro.Event WITH DESCRIPTION 'events' ), LOCALITY GROUP memory WITH DESCRIPTION 'recs storage' ( MAXVERSIONS = 10, TTL = FOREVER, INMEMORY = true, FAMILY recs ( recommended CLASS com.kiji.avro.ProductRecList WITH DESCRIPTION 'Recommended products.' ) );
  • 35. Users Table CREATE TABLE ’user_events' WITH DESCRIPTION 'Events table for online users.' ROW KEY FORMAT (type STRING, user_id STRING NOT NULL, HASH(THROUGH user_id)) PROPERTIES (NUMREGIONS = 32) WITH LOCALITY GROUP default WITH DESCRIPTION 'main storage' ( MAXVERSIONS = INFINITY, TTL = FOREVER, INMEMORY = false, MAP TYPE FAMILY events CLASS com.kiji.avro.Event WITH DESCRIPTION 'events' ), LOCALITY GROUP memory WITH DESCRIPTION 'recs storage' ( MAXVERSIONS = 10, TTL = FOREVER, INMEMORY = true, FAMILY recs ( recommended CLASS com.kiji.avro.ProductRecList WITH DESCRIPTION 'Recommended products.’ ) );
  • 36. In Summary… Designing applications in an entity-centric fashion can make them easier to build and more efficient Kiji can speed up the development process of 360º views