Building a Flexible, Real-time
Big Data Applications Platform
on Cassandra with Kiji
Cassandra Day Silicon Valley
07 April...
2
Have this...
Want to build this...
!
Kiji
!
3
History of the Kiji Project
• Created at WibiData
• Originally built on top of HBase
• Now works with Cassandra
One data...
Overview
• The Kiji Project
• Kiji data model
• Kiji on Cassandra
4
The Kiji Project
5
Components of Kiji
6
Batch
Data storage
Real-time
Kiji real-time components
• Score models
• Manage models
• REST interface
7
Hadoop, C*, HBase,
Avro
KijiSchema
KijiMR Kiji...
• Expressive DSL
• Machine learning library
• Hive
8
Hadoop, C*, HBase,
Avro
KijiSchema
KijiMR KijiREST
KijiHive KijiScori...
KijiSchema
• Serialization
• Complex data types
• Schema management
9
record UserLog {
long timestamp;
int user_id;
string...
KijiSchema
• Initially HBase-only
• Now HBase and
Cassandra
10
Hadoop, C*, HBase,
Avro
KijiSchema
KijiMR KijiREST
KijiHive...
In production now
• Fortune 500 retailer: Personalized
recommendations
• OPower: Energy usage and analytics
reporting
11
Kiji data model
12
13
table
14
table
row
row
row
row
row
row
row
row
row
row
row
row
row
15
entity ID data
16
Row key = entity ID
data0xfa “bob”
17
Composite entity IDs
info0xfa “bob” songs
18
Data organized into column families
songs:
let it be
songs:
help
songs:
helter
skelter
0xfa “bob”
info:
email
info:
payment
19
Column families contain columns...
songs:
let it be
songs:
help
songs:
helter
skelter
0xfa “bob”
info:
email
info:
payment songs:
let it besongs:
let it beso...
songs:
let it be
songs:
help
songs:
helter
skelter
0xfa “bob”
info:
email
info:
payment songs:
let it besongs:
let it beso...
info songsentity ID
22
Locality groups
Arrange data based on query pattern
info songs_todayentity ID songs_prev_year
23
Locality groups
Arrange data based on query pattern
Need only
one version of
...
info songs_todayentity ID songs_prev_year
24
Locality groups
Arrange data based on query pattern
Need only
one version of
...
KijiSchema
• Similar to Cassandra, HBase, BigTable
• Originally based on HBase ➔ timestamped versions
• Logical and physic...
Kiji on Cassandra
26
info songs_todayentity ID songs_prev_year
Locality groups ➔ Tables
27
Locality group ~ query
CREATE TABLE loc_grp...
Entity ID ➔ Primary key
28
songs:
let it be
songs:
help
songs:
helter
skelter
0xfa “bob”
info:
email
info:
payment songs:
...
Family, Qualifier, Version ➔ Clustering
Columns
29
songs:
let it be
songs:
help
songs:
helter
skelter
0xfa “bob”
info:
ema...
Column values ➔ Blobs
30
CREATE TABLE loc_grp (userid bigint, user text,
family text, qualifier text, version bigint, valu...
31
cqlsh:kiji_music>SELECT * FROM kiji_table_users;
userid | user | family | qualifier | timestamp | value
--------+------...
Physical organization of data on disk
32
songs:
let it be
songs:
help
songs:
helter
skelter
0xfa “bob” info:
email
info:
p...
Kiji queries ➔ CQL queries
33
Kiji queries can be complicated...
Kiji queries ➔ CQL queries
All data in “info” column family for “bob” ➔
SELECT qualifier, value FROM loc_grp_info
WHERE us...
Kiji queries ➔ CQL queries
Data in “info:email” and last play of “help” for “bob” ➔
SELECT value FROM lg_music WHERE useri...
Kiji queries ➔ CQL queries
All songs played by “bob” on April 2nd ➔
SELECT qualifier, value FROM lg_music
WHERE userid=0xf...
Kiji queries ➔ CQL queries
37
songs:
let it be
songs:
help
songs:
helter
skelter
0xfa “bob”
info:
email
info:
payment song...
Tricky queries
• Filter in CQL where possible
• Break up into multiple CQL queries
• Filter on the client
• Designing tabl...
MapReduce
• New InputFormat, OutputFormat
• Java driver
• Hadoop 2.x
• Multiple C* queries per RecordReader
39
Project status
40
Initial release in ~2 weeks
41
www.kiji.org/getstarted
Next quarter
• Cassandra in all Kiji components
• Expose Cassandra-specific features
• Kiji support in CQLSH
42
Thanks to Cassandra community
• Great help on mailing lists for users, dev, java
driver
• Webinars, meetups, C* Summit all...
Try it now — Kiji Bento Box
• Latest compatible versions of all components
• Hadoop, ZooKeeper, HBase
• Cassandra in ~2 we...
45
Upcoming SlideShare
Loading in …5
×

Cassandra Day Silicon Valley - April 2014: Building a flexible, real-time Big Data Applications platform on Cassandra

1,240 views

Published on

The Kiji Project is a modular, open-source framework that enables developers to efficiently build real-time Big Data applications. Kiji is built upon popular open-source technologies such as Cassandra, HBase, Hadoop, and Scalding, and contains components that implement functionality critical for Big Data applications, including the following:
• Support for evolvable schemas of complex data types

• Batch training of machine learning models with Hadoop

• Real-time scoring with trained modelsIntegration with Hive and R

• A REST endpoint

Recently, we have updated Kiji to use Cassandra as a backing data store (previously, Kiji worked only with HBase). In this talk, we describe the process of integrating Cassandra and Kiji. Topics we cover include the following:

• The Kiji architecture and data model

• Implementing the Kiji data model in Cassandra using the Java driver and CQL3

• Integrating Cassandra with Hadoop 2.x

• Building a flexible middleware platform that supports Cassandra and HBase (including projects that use both simultaneously)

• Exposing unique features of Cassandra (e.g., variable consistency) to Kiji users

0 Comments
0 Likes
Statistics
Notes
  • Be the first to comment

  • Be the first to like this

No Downloads
Views
Total views
1,240
On SlideShare
0
From Embeds
0
Number of Embeds
5
Actions
Shares
0
Downloads
10
Comments
0
Likes
0
Embeds 0
No embeds

No notes for slide

Cassandra Day Silicon Valley - April 2014: Building a flexible, real-time Big Data Applications platform on Cassandra

  1. 1. Building a Flexible, Real-time Big Data Applications Platform on Cassandra with Kiji Cassandra Day Silicon Valley 07 April 2014 Clint Kelly Member of Technical Staff WibiData 1
  2. 2. 2 Have this... Want to build this... ! Kiji !
  3. 3. 3 History of the Kiji Project • Created at WibiData • Originally built on top of HBase • Now works with Cassandra One data model for two databases ➔ challenges!
  4. 4. Overview • The Kiji Project • Kiji data model • Kiji on Cassandra 4
  5. 5. The Kiji Project 5
  6. 6. Components of Kiji 6 Batch Data storage Real-time
  7. 7. Kiji real-time components • Score models • Manage models • REST interface 7 Hadoop, C*, HBase, Avro KijiSchema KijiMR KijiREST KijiHive KijiScoring KijiExpress
  8. 8. • Expressive DSL • Machine learning library • Hive 8 Hadoop, C*, HBase, Avro KijiSchema KijiMR KijiREST KijiHive KijiScoring KijiExpress Kiji batch components
  9. 9. KijiSchema • Serialization • Complex data types • Schema management 9 record UserLog { long timestamp; int user_id; string url; long session_id; array<string> terms; } Hadoop, C*, HBase, Avro KijiSchema KijiMR KijiREST KijiHive KijiScoring KijiExpress
  10. 10. KijiSchema • Initially HBase-only • Now HBase and Cassandra 10 Hadoop, C*, HBase, Avro KijiSchema KijiMR KijiREST KijiHive KijiScoring KijiExpress
  11. 11. In production now • Fortune 500 retailer: Personalized recommendations • OPower: Energy usage and analytics reporting 11
  12. 12. Kiji data model 12
  13. 13. 13 table
  14. 14. 14 table row row row row row row row row row row row row
  15. 15. row 15
  16. 16. entity ID data 16 Row key = entity ID
  17. 17. data0xfa “bob” 17 Composite entity IDs
  18. 18. info0xfa “bob” songs 18 Data organized into column families
  19. 19. songs: let it be songs: help songs: helter skelter 0xfa “bob” info: email info: payment 19 Column families contain columns Column name = qualifier
  20. 20. songs: let it be songs: help songs: helter skelter 0xfa “bob” info: email info: payment songs: let it besongs: let it besongs: let it be songs: let it be 1396560123 20 Columns can have timestamped versions
  21. 21. songs: let it be songs: help songs: helter skelter 0xfa “bob” info: email info: payment songs: let it besongs: let it besongs: let it be songs: let it be 1396560123 21 Column values can be complex data types record SongPlay { long song_id; int user_rating; long session_id; device_type device; }
  22. 22. info songsentity ID 22 Locality groups Arrange data based on query pattern
  23. 23. info songs_todayentity ID songs_prev_year 23 Locality groups Arrange data based on query pattern Need only one version of each column. Need ASAP for real-time scoring; expires quickly. Used for training ML algorithms in batch; keep forever.
  24. 24. info songs_todayentity ID songs_prev_year 24 Locality groups Arrange data based on query pattern Need only one version of each column. Need ASAP for real-time scoring; expires quickly. Used for training ML algorithms in batch; keep forever. MAX_VERSIONS=1 TTL=FOREVER MAX_VERSIONS=INFINITE TTL=”1 DAY” CACHED MAX_VERSIONS=INFINITE TTL=FOREVER COMPRESSED
  25. 25. KijiSchema • Similar to Cassandra, HBase, BigTable • Originally based on HBase ➔ timestamped versions • Logical and physical organization are separate • Complex data types 25
  26. 26. Kiji on Cassandra 26
  27. 27. info songs_todayentity ID songs_prev_year Locality groups ➔ Tables 27 Locality group ~ query CREATE TABLE loc_grp...
  28. 28. Entity ID ➔ Primary key 28 songs: let it be songs: help songs: helter skelter 0xfa “bob” info: email info: payment songs: let it besongs: let it besongs: let it be songs: let it be 1396560123 CREATE TABLE loc_grp (userid bigint, user text, PRIMARY KEY (userid, user) ) WITH CLUSTERING ORDER BY (user ASC);
  29. 29. Family, Qualifier, Version ➔ Clustering Columns 29 songs: let it be songs: help songs: helter skelter 0xfa “bob” info: email info: payment songs: let it besongs: let it besongs: let it be songs: let it be 1396560123 CREATE TABLE loc_grp (userid bigint, user text, family text, qualifier text, version bigint, PRIMARY KEY (userid, user, family, qualifier, version) ) WITH CLUSTERING ORDER BY (user ASC, family ASC, qualifier ASC, version DESC);
  30. 30. Column values ➔ Blobs 30 CREATE TABLE loc_grp (userid bigint, user text, family text, qualifier text, version bigint, value blob, PRIMARY KEY (userid, user, family, qualifier, version) ) WITH CLUSTERING ORDER BY (user ASC, family ASC, qualifier ASC, version DESC); songs: let it be songs: help songs: helter skelter 0xfa “bob” info: email info: payment songs: let it besongs: let it besongs: let it be songs: let it be 1396560123
  31. 31. 31 cqlsh:kiji_music>SELECT * FROM kiji_table_users; userid | user | family | qualifier | timestamp | value --------+------+--------+----------------+-----------+--------------- 123456 | bob | songs | abbey road | 139656012 | 0x81274b31032 123456 | bob | songs | help | 139625013 | 0x7c13270f129 123456 | bob | songs | help | 139621359 | 0x2307ff10370 123456 | bob | songs | help | 139625013 | 0x45e1822a497 123456 | bob | songs | helter skelter | 139621324 | 0x104bb974c34 Distinct Kiji column ➔ CQL row
  32. 32. Physical organization of data on disk 32 songs: let it be songs: help songs: helter skelter 0xfa “bob” info: email info: payment songs: let it besongs: let it besongs: let it be songs: let it be 1396560123 0xfa:bob:info:email:t0:bob@gmail.com 0xfa:bob:info:payment:t1:AMEX1234... 0xfa:bob:songs:let it be:t5:... 0xfa:bob:songs:let it be:t4:… 0xfa:bob:songs:let it be:t2:… 0xfa:bob:songs:help:t2:… 0xfa:bob:songs:helter skelter:t1:… Efficient queries = continuous scans!
  33. 33. Kiji queries ➔ CQL queries 33 Kiji queries can be complicated...
  34. 34. Kiji queries ➔ CQL queries All data in “info” column family for “bob” ➔ SELECT qualifier, value FROM loc_grp_info WHERE userid=0xfa AND user=‘bob’ AND family=‘info’ LIMIT 1; 34 songs: let it be songs: help songs: helter skelter 0xfa “bob” info: email info: payment songs: let it besongs: let it besongs: let it be songs: let it be 1396560123
  35. 35. Kiji queries ➔ CQL queries Data in “info:email” and last play of “help” for “bob” ➔ SELECT value FROM lg_music WHERE userid=0xfa AND user=‘bob’ AND family=‘info’ AND qualifier=‘email’; SELECT value FROM lg_music WHERE userid=0xfa AND user=‘bob’ AND family=‘songs’ AND qualifier=‘help’ LIMIT 1; 35 songs: let it be songs: help songs: helter skelter 0xfa “bob” info: email info: payment songs: let it besongs: let it besongs: let it be songs: let it be 1396560123
  36. 36. Kiji queries ➔ CQL queries All songs played by “bob” on April 2nd ➔ SELECT qualifier, value FROM lg_music WHERE userid=0xfa AND user=‘bob’ AND family=‘songs’ AND timestamp >= 1396396800 AND timestamp <= 1396483200 ALLOW FILTERING; 😱😱 36 songs: let it be songs: help songs: helter skelter 0xfa “bob” info: email info: payment songs: let it besongs: let it besongs: let it be songs: let it be 1396560123
  37. 37. Kiji queries ➔ CQL queries 37 songs: let it be songs: help songs: helter skelter 0xfa “bob” info: email info: payment songs: let it besongs: let it besongs: let it be songs: let it be 1396560123 ! Bad Request: PRIMARY KEY part timestamp cannot be restricted (preceding part qualifier is either not restricted or by a non-EQ relation)
  38. 38. Tricky queries • Filter in CQL where possible • Break up into multiple CQL queries • Filter on the client • Designing table layout is important 38
  39. 39. MapReduce • New InputFormat, OutputFormat • Java driver • Hadoop 2.x • Multiple C* queries per RecordReader 39
  40. 40. Project status 40
  41. 41. Initial release in ~2 weeks 41 www.kiji.org/getstarted
  42. 42. Next quarter • Cassandra in all Kiji components • Expose Cassandra-specific features • Kiji support in CQLSH 42
  43. 43. Thanks to Cassandra community • Great help on mailing lists for users, dev, java driver • Webinars, meetups, C* Summit all available online • Free training from DataStax • Very easy to get up-to-speed • Thanks to hosts and organizers today 43
  44. 44. Try it now — Kiji Bento Box • Latest compatible versions of all components • Hadoop, ZooKeeper, HBase • Cassandra in ~2 weeks 44 www.kiji.org/getstarted http://jobs.wibidata.com/
  45. 45. 45

×