Design Patterns for Building 360-degree Views with HBase and Kiji

2,521 views

Published on

Speaker: Jonathan Natkins (WibiData)

Many companies aspire to have 360-degree views of their data. Whether they're concerned about customers, users, accounts, or more abstract things like sensors, organizations are focused on developing capabilities for analyzing all the data they have about these entities. This talk will introduce the concept of entity-centric storage, discuss what it means, what it enables for businesses, and how to develop an entity-centric system using the open-source Kiji framework and HBase. It will also compare and contrast traditional methods of building a 360-degree view on a relational database versus building against a distributed key-value store, and why HBase is a good choice for implementing an entity-centric system.

Published in: Software, Technology
0 Comments
3 Likes
Statistics
Notes
  • Be the first to comment

No Downloads
Views
Total views
2,521
On SlideShare
0
From Embeds
0
Number of Embeds
97
Actions
Shares
0
Downloads
161
Comments
0
Likes
3
Embeds 0
No embeds

No notes for slide

Design Patterns for Building 360-degree Views with HBase and Kiji

  1. 1. Design Patterns for 360º Views using HBase and Kiji Jonathan Natkins
  2. 2. Who am I? Jon “Natty” Natkins Field Engineer at WibiData Formerly at Cloudera/Vertica
  3. 3. What is a 360º View?
  4. 4. What is a 360º View For? Past What interactions has a customer had in the past? Present What is the customer doing right now? Future What is the customer likely do to next? Past and present inform the future
  5. 5. What If I Don’t Care About Customers?
  6. 6. Generalizing the 360º View: Entity-Centric Systems
  7. 7. Goal of an Entity-Centric System “Show me everything I know about Natty”
  8. 8. What Data Do I Need to Store? Static data Event-oriented data Derived data
  9. 9. Building Entity-Centric Systems Often, this is an EDW with a star schema Fact Dim Dim Dim Dim
  10. 10. Challenges With Star Schemas How do we answer the original question? Full table scan + joins OLTP systems will likely fall over from the volume OLAP systems are usually not optimized for single-row lookups
  11. 11. Need Something Else…
  12. 12. Why HBase rows can store both static and event-oriented data Cell versions are key Single-row lookups are extremely fast
  13. 13. is for Building Entity-Centric Systems Often used for: Building recommendation systems Personalized search Real-time HBase applications Underlying technologies:
  14. 14. Designing an Entity-Centric Datastore Ask yourself this: what is the entity? Determine your entity by determining how you want to analyze the data It’s ok to have data organized in multiple ways
  15. 15. Schema Management with Kiji Sometimes you actually want a schema layer Defining a schema allows for data discoverability
  16. 16. Column Families in Kiji Kiji has two types of column families Group families are similar to relational tables Predefined set of columns Each column has its own data type Map families specify columns at runtime Every column has the same data type
  17. 17. sessions:23 45 sessions:23 45 sessions:23 45 sessions:12 34 sessions:12 34 info:purchase s Knowing When To Use Different Family Types Do you know all of your columns up front? Then use a group family Map families are for when you don’t know your columns ahead of time info:name info:email sessions:12 34 sessions:23 45 info:purchase s info:purchase s
  18. 18. Choosing a Row Key Row keys in Kiji are componentized [ ‘component1’, ‘component2’, 1234 ] More efficient than byte arrays Consider ‘1234567890’ versus [ 1234567890 ] Good for scanning areas of the keyspace
  19. 19. A Common Use for Components Known users IDs versus unknown IDs On a website, how do you differentiate between a logged-in or cookie’d user versus a brand new visitor [ ‘K’, ‘user1234’ ] or [ ‘U’, ‘unknown2345’ ] Physically and logically separate rows Run jobs over all known or unknown users
  20. 20. Identifying Known Users Problem: Users have many cookies over time. Challenge: Ideally, we would have a single row for each user. How do we ensure that new data goes to the right row?
  21. 21. Finding Known Users With Lookup Tables HBase get operations are fast It’s easy enough to create a table that contains a mapping of cookies to known user IDs When data is loaded, check the lookup table to determine if you should write data to an existing row or a new one
  22. 22. Avoiding Hotspots
  23. 23. Unhashed Row Keys Node 1 Node 2 Node 3 Region A-B Region B-C Region D-E Region F-G Region H-I Region J-K
  24. 24. Hash-Prefixed Row Keys Node 1 Node 2 Node 3 Region 00A-0fK Region 10A-1fK Region 20A-2fK Region 30A-3fK Region 40A-4fK Region 50A-5fK
  25. 25. Storing Event Series 360º views need easy access to all the transactions and events for a user HBase cells may contain more than one version Kiji leverages this to store event series data like clicks or purchases sessions:23 45 sessions:23 45 sessions:23 45 sessions:12 34 sessions:12 34 info:purchase sinfo:name info:email sessions:12 34 sessions:23 45 info:purchase s info:purchase s
  26. 26. How Many Events is Too Many? The HBase book warns that too many versions of a cell can cause StoreFile bloat HBase will never split a row Common tactic is to add a timestamp range to the row key Kiji makes this easy with componentized row
  27. 27. Beware of Timestamp Misuse A major reason the HBase book warns against mucking with timestamps is that they can be dangerous What happens if you use a sequence number as a timestamp? Think about TTLs
  28. 28. Iterate and Evolve
  29. 29. Why is Evolution Necessary? No entity-centric system will be the end-all, be-all the first time around Data sources in large enterprises are usually heavily silo’d Start small Incorporate new data sources over time
  30. 30. Putting it Together Kiji includes a shell to use DDL to create tables Many of the features that have been discussed are declarative via the DDL
  31. 31. Users Table CREATE TABLE ’user_events' WITH DESCRIPTION 'Events table for online users.' ROW KEY FORMAT (type STRING, user_id STRING NOT NULL, HASH(THROUGH user_id)) PROPERTIES (NUMREGIONS = 32) WITH LOCALITY GROUP default WITH DESCRIPTION 'main storage' ( MAXVERSIONS = INFINITY, TTL = FOREVER, INMEMORY = false, MAP TYPE FAMILY events CLASS com.kiji.avro.Event WITH DESCRIPTION 'events' ), LOCALITY GROUP memory WITH DESCRIPTION 'recs storage' ( MAXVERSIONS = 10, TTL = FOREVER, INMEMORY = true, FAMILY recs ( recommended CLASS com.kiji.avro.ProductRecList WITH DESCRIPTION 'Recommended products.' ) );
  32. 32. Users Table CREATE TABLE ’user_events' WITH DESCRIPTION 'Events table for online users.' ROW KEY FORMAT (type STRING, user_id STRING NOT NULL, HASH(THROUGH user_id)) PROPERTIES (NUMREGIONS = 32) WITH LOCALITY GROUP default WITH DESCRIPTION 'main storage' ( MAXVERSIONS = INFINITY, TTL = FOREVER, INMEMORY = false, MAP TYPE FAMILY events CLASS com.kiji.avro.Event WITH DESCRIPTION 'events' ), LOCALITY GROUP memory WITH DESCRIPTION 'recs storage' ( MAXVERSIONS = 10, TTL = FOREVER, INMEMORY = true, FAMILY recs ( recommended CLASS com.kiji.avro.ProductRecList WITH DESCRIPTION 'Recommended products.' ) );
  33. 33. Users TableCREATE TABLE ’user_events' WITH DESCRIPTION 'Events table for online users.' ROW KEY FORMAT (type STRING, user_id STRING NOT NULL, HASH(THROUGH user_id)) PROPERTIES (NUMREGIONS = 32) WITH LOCALITY GROUP default WITH DESCRIPTION 'main storage' ( MAXVERSIONS = INFINITY, TTL = FOREVER, INMEMORY = false, MAP TYPE FAMILY events CLASS com.kiji.avro.Event WITH DESCRIPTION 'events' ), LOCALITY GROUP memory WITH DESCRIPTION 'recs storage' ( MAXVERSIONS = 10, TTL = FOREVER, INMEMORY = true, FAMILY recs ( recommended CLASS com.kiji.avro.ProductRecList WITH DESCRIPTION 'Recommended products.' ) );
  34. 34. Users Table CREATE TABLE ’user_events' WITH DESCRIPTION 'Events table for online users.' ROW KEY FORMAT (type STRING, user_id STRING NOT NULL, HASH(THROUGH user_id)) PROPERTIES (NUMREGIONS = 32) WITH LOCALITY GROUP default WITH DESCRIPTION 'main storage' ( MAXVERSIONS = INFINITY, TTL = FOREVER, INMEMORY = false, MAP TYPE FAMILY events CLASS com.kiji.avro.Event WITH DESCRIPTION 'events' ), LOCALITY GROUP memory WITH DESCRIPTION 'recs storage' ( MAXVERSIONS = 10, TTL = FOREVER, INMEMORY = true, FAMILY recs ( recommended CLASS com.kiji.avro.ProductRecList WITH DESCRIPTION 'Recommended products.’ ) );
  35. 35. In Summary… Designing applications in an entity-centric fashion can make them easier to build and more efficient Kiji can speed up the development process of 360º views
  36. 36. Questions? Contact me natty@wibidata.com @nattyice The Kiji Project: kiji.org

×