Design Patterns for 360º Views
using HBase and Kiji
Jonathan Natkins
Who am I?
Jon “Natty” Natkins
Field Engineer at WibiData
Formerly at Cloudera/Vertica
What is a 360º View?
What is a 360º View For?
Past
What interactions has a customer had in the past?
Present
What is the customer doing right now?
Future
What is the customer likely do to next?
Past and present inform the future
What If I Don’t Care About
Customers?
Generalizing the 360º View:
Entity-Centric Systems
Goal of an Entity-Centric
System
“Show me everything I know
about Natty”
What Data Do I Need to Store?
Static data
Event-oriented data
Derived data
Building Entity-Centric Systems
Often, this is an EDW with a star schema
Fact
Dim
Dim
Dim
Dim
Challenges With Star Schemas
How do we answer the original question?
Full table scan + joins
OLTP systems will likely fall over from the
volume
OLAP systems are usually not optimized for
single-row lookups
Need Something
Else…
Why
HBase rows can store both static and
event-oriented data
Cell versions are key
Single-row lookups are extremely fast
is for Building
Entity-Centric Systems
Often used for:
Building recommendation systems
Personalized search
Real-time HBase applications
Underlying technologies:
Designing an Entity-Centric
Datastore
Ask yourself this: what is the entity?
Determine your entity by determining how
you want to analyze the data
It’s ok to have data organized in multiple
ways
Schema Management with Kiji
Sometimes you actually want a schema layer
Defining a schema allows for data discoverability
Column Families in Kiji
Kiji has two types of column families
Group families are similar to relational
tables
Predefined set of columns
Each column has its own data type
Map families specify columns at runtime
Every column has the same data type
sessions:23
45
sessions:23
45
sessions:23
45
sessions:12
34
sessions:12
34
info:purchase
s
Knowing When To Use Different
Family Types
Do you know all of your columns up front?
Then use a group family
Map families are for when you don’t know
your columns ahead of time
info:name info:email
sessions:12
34
sessions:23
45
info:purchase
s
info:purchase
s
Choosing a Row Key
Row keys in Kiji are componentized
[ ‘component1’, ‘component2’, 1234 ]
More efficient than byte arrays
Consider ‘1234567890’ versus [ 1234567890 ]
Good for scanning areas of the keyspace
A Common Use for
Components
Known users IDs versus unknown IDs
On a website, how do you differentiate
between a logged-in or cookie’d user versus a
brand new visitor
[ ‘K’, ‘user1234’ ] or [ ‘U’, ‘unknown2345’ ]
Physically and logically separate rows
Run jobs over all known or unknown users
Identifying Known Users
Problem: Users have many cookies over
time.
Challenge: Ideally, we would have a single
row for each user. How do we ensure that
new data goes to the right row?
Finding Known Users With
Lookup Tables
HBase get operations are fast
It’s easy enough to create a table that
contains a mapping of cookies to known
user IDs
When data is loaded, check the lookup
table to determine if you should write data
to an existing row or a new one
Avoiding Hotspots
Unhashed Row Keys
Node 1 Node 2 Node 3
Region
A-B
Region
B-C
Region
D-E
Region
F-G
Region
H-I
Region
J-K
Hash-Prefixed Row Keys
Node 1 Node 2 Node 3
Region
00A-0fK
Region
10A-1fK
Region
20A-2fK
Region
30A-3fK
Region
40A-4fK
Region
50A-5fK
Storing Event Series
360º views need easy access to all the
transactions and events for a user
HBase cells may contain more than one
version
Kiji leverages this to store event series
data like clicks or purchases sessions:23
45
sessions:23
45
sessions:23
45
sessions:12
34
sessions:12
34
info:purchase
sinfo:name info:email
sessions:12
34
sessions:23
45
info:purchase
s
info:purchase
s
How Many Events is Too Many?
The HBase book warns that too many
versions of a cell can cause StoreFile
bloat
HBase will never split a row
Common tactic is to add a timestamp
range to the row key
Kiji makes this easy with componentized row
Beware of Timestamp Misuse
A major reason the HBase book warns
against mucking with timestamps is that
they can be dangerous
What happens if you use a sequence number
as a timestamp? Think about TTLs
Iterate and Evolve
Why is Evolution Necessary?
No entity-centric system will be the end-all,
be-all the first time around
Data sources in large enterprises are
usually heavily silo’d
Start small
Incorporate new data sources over time
Putting it Together
Kiji includes a shell to use DDL to create
tables
Many of the features that have been
discussed are declarative via the DDL
Users Table
CREATE TABLE ’user_events' WITH DESCRIPTION 'Events table for online users.'
ROW KEY FORMAT (type STRING, user_id STRING NOT NULL, HASH(THROUGH user_id))
PROPERTIES (NUMREGIONS = 32)
WITH LOCALITY GROUP default WITH DESCRIPTION 'main storage' (
MAXVERSIONS = INFINITY,
TTL = FOREVER,
INMEMORY = false,
MAP TYPE FAMILY events CLASS com.kiji.avro.Event WITH DESCRIPTION 'events'
),
LOCALITY GROUP memory WITH DESCRIPTION 'recs storage' (
MAXVERSIONS = 10,
TTL = FOREVER,
INMEMORY = true,
FAMILY recs (
recommended CLASS com.kiji.avro.ProductRecList
WITH DESCRIPTION 'Recommended products.'
)
);
Users Table
CREATE TABLE ’user_events' WITH DESCRIPTION 'Events table for online users.'
ROW KEY FORMAT (type STRING, user_id STRING NOT
NULL, HASH(THROUGH user_id))
PROPERTIES (NUMREGIONS = 32)
WITH LOCALITY GROUP default WITH DESCRIPTION 'main storage' (
MAXVERSIONS = INFINITY,
TTL = FOREVER,
INMEMORY = false,
MAP TYPE FAMILY events CLASS com.kiji.avro.Event WITH DESCRIPTION 'events'
),
LOCALITY GROUP memory WITH DESCRIPTION 'recs storage' (
MAXVERSIONS = 10,
TTL = FOREVER,
INMEMORY = true,
FAMILY recs (
recommended CLASS com.kiji.avro.ProductRecList
WITH DESCRIPTION 'Recommended products.'
)
);
Users TableCREATE TABLE ’user_events' WITH DESCRIPTION 'Events table for online users.'
ROW KEY FORMAT (type STRING, user_id STRING NOT NULL, HASH(THROUGH user_id))
PROPERTIES (NUMREGIONS = 32)
WITH LOCALITY GROUP default
WITH DESCRIPTION 'main storage' (
MAXVERSIONS = INFINITY,
TTL = FOREVER,
INMEMORY = false,
MAP TYPE FAMILY events CLASS
com.kiji.avro.Event
WITH DESCRIPTION 'events'
),
LOCALITY GROUP memory WITH DESCRIPTION 'recs storage' (
MAXVERSIONS = 10,
TTL = FOREVER,
INMEMORY = true,
FAMILY recs (
recommended CLASS com.kiji.avro.ProductRecList
WITH DESCRIPTION 'Recommended products.'
)
);
Users Table
CREATE TABLE ’user_events' WITH DESCRIPTION 'Events table for online users.'
ROW KEY FORMAT (type STRING, user_id STRING NOT NULL, HASH(THROUGH user_id))
PROPERTIES (NUMREGIONS = 32)
WITH LOCALITY GROUP default WITH DESCRIPTION 'main storage' (
MAXVERSIONS = INFINITY,
TTL = FOREVER,
INMEMORY = false,
MAP TYPE FAMILY events CLASS com.kiji.avro.Event WITH DESCRIPTION 'events'
),
LOCALITY GROUP memory WITH DESCRIPTION 'recs storage' (
MAXVERSIONS = 10,
TTL = FOREVER,
INMEMORY = true,
FAMILY recs (
recommended CLASS
com.kiji.avro.ProductRecList
WITH DESCRIPTION 'Recommended products.’
)
);
In Summary…
Designing applications in an entity-centric
fashion can make them easier to build and
more efficient
Kiji can speed up the development
process of 360º views
Questions?
Contact me
natty@wibidata.com
@nattyice
The Kiji Project: kiji.org

Design Patterns for Building 360-degree Views with HBase and Kiji

  • 1.
    Design Patterns for360º Views using HBase and Kiji Jonathan Natkins
  • 2.
    Who am I? Jon“Natty” Natkins Field Engineer at WibiData Formerly at Cloudera/Vertica
  • 3.
    What is a360º View?
  • 4.
    What is a360º View For? Past What interactions has a customer had in the past? Present What is the customer doing right now? Future What is the customer likely do to next? Past and present inform the future
  • 5.
    What If IDon’t Care About Customers?
  • 6.
    Generalizing the 360ºView: Entity-Centric Systems
  • 7.
    Goal of anEntity-Centric System “Show me everything I know about Natty”
  • 8.
    What Data DoI Need to Store? Static data Event-oriented data Derived data
  • 9.
    Building Entity-Centric Systems Often,this is an EDW with a star schema Fact Dim Dim Dim Dim
  • 10.
    Challenges With StarSchemas How do we answer the original question? Full table scan + joins OLTP systems will likely fall over from the volume OLAP systems are usually not optimized for single-row lookups
  • 11.
  • 13.
    Why HBase rows canstore both static and event-oriented data Cell versions are key Single-row lookups are extremely fast
  • 14.
    is for Building Entity-CentricSystems Often used for: Building recommendation systems Personalized search Real-time HBase applications Underlying technologies:
  • 15.
    Designing an Entity-Centric Datastore Askyourself this: what is the entity? Determine your entity by determining how you want to analyze the data It’s ok to have data organized in multiple ways
  • 16.
    Schema Management withKiji Sometimes you actually want a schema layer Defining a schema allows for data discoverability
  • 17.
    Column Families inKiji Kiji has two types of column families Group families are similar to relational tables Predefined set of columns Each column has its own data type Map families specify columns at runtime Every column has the same data type
  • 18.
    sessions:23 45 sessions:23 45 sessions:23 45 sessions:12 34 sessions:12 34 info:purchase s Knowing When ToUse Different Family Types Do you know all of your columns up front? Then use a group family Map families are for when you don’t know your columns ahead of time info:name info:email sessions:12 34 sessions:23 45 info:purchase s info:purchase s
  • 19.
    Choosing a RowKey Row keys in Kiji are componentized [ ‘component1’, ‘component2’, 1234 ] More efficient than byte arrays Consider ‘1234567890’ versus [ 1234567890 ] Good for scanning areas of the keyspace
  • 20.
    A Common Usefor Components Known users IDs versus unknown IDs On a website, how do you differentiate between a logged-in or cookie’d user versus a brand new visitor [ ‘K’, ‘user1234’ ] or [ ‘U’, ‘unknown2345’ ] Physically and logically separate rows Run jobs over all known or unknown users
  • 21.
    Identifying Known Users Problem:Users have many cookies over time. Challenge: Ideally, we would have a single row for each user. How do we ensure that new data goes to the right row?
  • 22.
    Finding Known UsersWith Lookup Tables HBase get operations are fast It’s easy enough to create a table that contains a mapping of cookies to known user IDs When data is loaded, check the lookup table to determine if you should write data to an existing row or a new one
  • 23.
  • 24.
    Unhashed Row Keys Node1 Node 2 Node 3 Region A-B Region B-C Region D-E Region F-G Region H-I Region J-K
  • 25.
    Hash-Prefixed Row Keys Node1 Node 2 Node 3 Region 00A-0fK Region 10A-1fK Region 20A-2fK Region 30A-3fK Region 40A-4fK Region 50A-5fK
  • 26.
    Storing Event Series 360ºviews need easy access to all the transactions and events for a user HBase cells may contain more than one version Kiji leverages this to store event series data like clicks or purchases sessions:23 45 sessions:23 45 sessions:23 45 sessions:12 34 sessions:12 34 info:purchase sinfo:name info:email sessions:12 34 sessions:23 45 info:purchase s info:purchase s
  • 27.
    How Many Eventsis Too Many? The HBase book warns that too many versions of a cell can cause StoreFile bloat HBase will never split a row Common tactic is to add a timestamp range to the row key Kiji makes this easy with componentized row
  • 28.
    Beware of TimestampMisuse A major reason the HBase book warns against mucking with timestamps is that they can be dangerous What happens if you use a sequence number as a timestamp? Think about TTLs
  • 29.
  • 30.
    Why is EvolutionNecessary? No entity-centric system will be the end-all, be-all the first time around Data sources in large enterprises are usually heavily silo’d Start small Incorporate new data sources over time
  • 31.
    Putting it Together Kijiincludes a shell to use DDL to create tables Many of the features that have been discussed are declarative via the DDL
  • 32.
    Users Table CREATE TABLE’user_events' WITH DESCRIPTION 'Events table for online users.' ROW KEY FORMAT (type STRING, user_id STRING NOT NULL, HASH(THROUGH user_id)) PROPERTIES (NUMREGIONS = 32) WITH LOCALITY GROUP default WITH DESCRIPTION 'main storage' ( MAXVERSIONS = INFINITY, TTL = FOREVER, INMEMORY = false, MAP TYPE FAMILY events CLASS com.kiji.avro.Event WITH DESCRIPTION 'events' ), LOCALITY GROUP memory WITH DESCRIPTION 'recs storage' ( MAXVERSIONS = 10, TTL = FOREVER, INMEMORY = true, FAMILY recs ( recommended CLASS com.kiji.avro.ProductRecList WITH DESCRIPTION 'Recommended products.' ) );
  • 33.
    Users Table CREATE TABLE’user_events' WITH DESCRIPTION 'Events table for online users.' ROW KEY FORMAT (type STRING, user_id STRING NOT NULL, HASH(THROUGH user_id)) PROPERTIES (NUMREGIONS = 32) WITH LOCALITY GROUP default WITH DESCRIPTION 'main storage' ( MAXVERSIONS = INFINITY, TTL = FOREVER, INMEMORY = false, MAP TYPE FAMILY events CLASS com.kiji.avro.Event WITH DESCRIPTION 'events' ), LOCALITY GROUP memory WITH DESCRIPTION 'recs storage' ( MAXVERSIONS = 10, TTL = FOREVER, INMEMORY = true, FAMILY recs ( recommended CLASS com.kiji.avro.ProductRecList WITH DESCRIPTION 'Recommended products.' ) );
  • 34.
    Users TableCREATE TABLE’user_events' WITH DESCRIPTION 'Events table for online users.' ROW KEY FORMAT (type STRING, user_id STRING NOT NULL, HASH(THROUGH user_id)) PROPERTIES (NUMREGIONS = 32) WITH LOCALITY GROUP default WITH DESCRIPTION 'main storage' ( MAXVERSIONS = INFINITY, TTL = FOREVER, INMEMORY = false, MAP TYPE FAMILY events CLASS com.kiji.avro.Event WITH DESCRIPTION 'events' ), LOCALITY GROUP memory WITH DESCRIPTION 'recs storage' ( MAXVERSIONS = 10, TTL = FOREVER, INMEMORY = true, FAMILY recs ( recommended CLASS com.kiji.avro.ProductRecList WITH DESCRIPTION 'Recommended products.' ) );
  • 35.
    Users Table CREATE TABLE’user_events' WITH DESCRIPTION 'Events table for online users.' ROW KEY FORMAT (type STRING, user_id STRING NOT NULL, HASH(THROUGH user_id)) PROPERTIES (NUMREGIONS = 32) WITH LOCALITY GROUP default WITH DESCRIPTION 'main storage' ( MAXVERSIONS = INFINITY, TTL = FOREVER, INMEMORY = false, MAP TYPE FAMILY events CLASS com.kiji.avro.Event WITH DESCRIPTION 'events' ), LOCALITY GROUP memory WITH DESCRIPTION 'recs storage' ( MAXVERSIONS = 10, TTL = FOREVER, INMEMORY = true, FAMILY recs ( recommended CLASS com.kiji.avro.ProductRecList WITH DESCRIPTION 'Recommended products.’ ) );
  • 36.
    In Summary… Designing applicationsin an entity-centric fashion can make them easier to build and more efficient Kiji can speed up the development process of 360º views
  • 37.