0
Building a Flexible, Real-time
Big Data Applications Platform
on Cassandra with Kiji
Cassandra Day Silicon Valley
07 April...
Overview
• The Kiji Project
• The Kiji data model and KijiSchema
• Mapping Kiji to Cassandra
• Status and future work
• Tr...
The Kiji Project
3
4
!
Want to build this...
Have this...
5
!
Want to build this...
!
Have this...
6
Want to build this...
Open source components
• Batch processing
– Extract, transform, load
– Train machine learning models
• Scalable storage
– ...
KijiSchema
• Schemas and data serialization
• Complex, atomic data types
8
Hadoop, C*, HBase, Avro
KijiSchema
KijiMR KijiR...
Kiji batch components
• Scala DSL ➔ describe
MapReduce computations
• Machine learning library
• Hive adapter
9
Hadoop, C*...
Kiji real-time components
• REST server
• Scoring server
10
Hadoop, C*, HBase, Avro
KijiSchema
KijiMR KijiREST
KijiHive Ki...
Kiji Summary
• Bridge between open-source technologies
and real-time, big data applications
• Users are building real syst...
The Kiji data model and
KijiSchema
12
row
13
Table are composed of rows.
entity ID data
14
We call row keys “entity IDs.”
data0xfa “bob”
15
We support composite entity IDs (with
hashed and unhashed components).
info0xfa “bob” songs
16
Data in rows is organized into “column
families.”
songs:
let it be
songs:
help
songs:
helter
skelter
0xfa “bob”
info:
email
info:
payment
17
Column families contain columns...
songs:
let it be
songs:
help
songs:
helter
skelter
0xfa “bob”
info:
email
info:
payment songs:
let it besongs:
let it beso...
songs:
let it be
songs:
help
songs:
helter
skelter
0xfa “bob”
info:
email
info:
payment songs:
let it besongs:
let it beso...
20
Locality groups
Separate logical organization of data
(column families) from physical
attributes (caching, compression,...
21
Locality groups
Separate logical organization of data
(column families) from physical
attributes (caching, compression,...
info songs_todayentity ID songs_prev_year
“real_time” (in-memory,
uncompressed, TTL = 1 day)
“batch” (compressed,
TTL = 12...
KijiSchema summary
• Data model similar to Cassandra, HBase,
BigTable
• Contains time dimension (not present in C*)
• Logi...
Mapping Kiji to Cassandra
24
Implementation notes
25
• Built for Cassandra 2.0.6+
• Native protocol / Java driver (no Thrift)
• Asynchronous API
• Assu...
Mapping a Kiji table ➔ Cassandra
• Locality group ➔ Table
• Entity ID ➔ Primary key
– Hashed components ➔ partition key
– ...
CQL for Kiji locality group
CREATE TABLE users_locality_group_fast (
userid bigint,
user text,
family text,
qualifier text...
28
cqlsh:kiji_music>SELECT * FROM kiji_table_users;
userid | username | family | qualifier | timestamp | value
--------+--...
Physical organization of data on disk
29
songs:
let it be
songs:
help
songs:
helter
skelter
0xfa “bob” info:
email
info:
p...
Kiji queries ➔ CQL queries
All data in “info” column family for “bob” ➔
SELECT qualifier, value FROM music
WHERE userid=0x...
Kiji queries ➔ CQL queries
Data in “info:email” and last play of “help” for “bob” ➔
SELECT value FROM music WHERE userid=0...
Kiji queries ➔ CQL queries
All songs played by “bob” on April 2nd ➔
SELECT qualifier, value FROM music WHERE
userid=0xfa A...
Kiji queries ➔ CQL queries
33
songs:
let it be
songs:
help
songs:
helter
skelter
0xfa “bob”
info:
email
info:
payment song...
Queries that do not map well to CQL
• Break up into multiple CQL queries
– Hooray for Session#executeAsync!
• Filter on th...
MapReduce
• Wrote new InputFormat, OutputFormat
• Hadoop 2.x
• Multiple C* queries per RecordReader
• Does not use Thrift
...
Project status and next steps
36
Initial release in ~ 2 weeks
37
• Cassandra as part of the Bento Box
• Cassandra working in KijiSchema, KijiMR
Support in the coming months
• Cassandra integration with KijiREST,
KijiScoring, KijiExpress, etc.
• Expose Cassandra-spec...
Thanks to Cassandra community
• Great help on mailing lists for users, dev, java
driver
• Webinars, meetups, C* Summit all...
Try it now -- Kiji Bento Box
• Latest compatible versions of all components
• Hadoop, ZooKeeper, HBase
• Cassandra in ~2 w...
KijiSchema
• Schemas and data serialization
• Complex data types (e.g.,
nested maps)
• Schema evolution
• Metadata
• Compo...
42
Schema support
Support for complex schemas with Avro
record UserLog {
long timestamp;
int user_id;
string url;
}
KijiSc...
43
Column name translation
•“family:qualifier” -> “A:B”
•Saves disk space
•Improves performance
•User-facing tools transla...
Kiji queries ➔ CQL queries
All data in family “songs” for user “bob” ➔
SELECT qualifier, value FROM music
WHERE userid=0xf...
Upcoming SlideShare
Loading in...5
×

Cassandra Day SV 2014: Building a Flexible, Real-time Big Data Applications Platform on Apache Cassandra with Kiji

541

Published on

The talk is about the process of adding support for Cassandra in Kiji, our open-source platform for building big-data applications. I start off by describing the Kiji project, how it enables folks to build big-data applications, and (hopefully) get everyone excited about it. Then I talk about the Kiji data model, its origins in HBase (we initially built Kiji on top of HBase), how we updated it to also support Cassandra, what some if the issues were, etc. I get into some detail about our use of the Java driver and its async API, how we translate operations in Kiji into CQL statements, and some enhancements we've made to the Hadoop InputFormat and OutputFormat. I think this talk will be interesting to folks in general, and in particular will be useful for anyone who has an HBase background and is now working with Cassandra.

The Kiji Project is a modular, open-source framework that enables developers to efficiently build real-time Big Data applications. Kiji is built upon popular open-source technologies such as Cassandra, HBase, Hadoop, and Scalding, and contains components that implement functionality critical for Big Data applications, including the following:
Support for evolvable schemas of complex data types
Batch training of machine learning models with Hadoop
Real-time scoring with trained models
Integration with Hive and R
A REST endpoint

Recently, we have updated Kiji to use Cassandra as a backing data store (previously, Kiji worked only with HBase). In this talk, we describe the process of integrating Cassandra and Kiji. Topics we cover include the following:
The Kiji architecture and data model
Implementing the Kiji data model in Cassandra using the Java driver and CQL3
Integrating Cassandra with Hadoop 2.x
Building a flexible middleware platform that supports Cassandra and HBase (including projects that use both simultaneously)
Exposing unique features of Cassandra (e.g., variable consistency) to Kiji users

Published in: Technology, Business
0 Comments
0 Likes
Statistics
Notes
  • Be the first to comment

  • Be the first to like this

No Downloads
Views
Total Views
541
On Slideshare
0
From Embeds
0
Number of Embeds
2
Actions
Shares
0
Downloads
14
Comments
0
Likes
0
Embeds 0
No embeds

No notes for slide

Transcript of "Cassandra Day SV 2014: Building a Flexible, Real-time Big Data Applications Platform on Apache Cassandra with Kiji"

  1. 1. Building a Flexible, Real-time Big Data Applications Platform on Cassandra with Kiji Cassandra Day Silicon Valley 07 April 2014 Clint Kelly Member of Technical Staff WibiData 1
  2. 2. Overview • The Kiji Project • The Kiji data model and KijiSchema • Mapping Kiji to Cassandra • Status and future work • Try it now! 2 Should there be any intro page that talks about WibiData anywhere?
  3. 3. The Kiji Project 3
  4. 4. 4 ! Want to build this...
  5. 5. Have this... 5 ! Want to build this...
  6. 6. ! Have this... 6 Want to build this...
  7. 7. Open source components • Batch processing – Extract, transform, load – Train machine learning models • Scalable storage – Time-series data • Serialization – Complex data types 7 Hadoop, C*, HBase, Avro KijiSchema KijiMR KijiREST KijiHive KijiScoring KijiExpress
  8. 8. KijiSchema • Schemas and data serialization • Complex, atomic data types 8 Hadoop, C*, HBase, Avro KijiSchema KijiMR KijiREST KijiHive KijiScoring KijiExpress record UserLog { long timestamp; int user_id; string url; long session_id; } • Schema evolution • Table metadata
  9. 9. Kiji batch components • Scala DSL ➔ describe MapReduce computations • Machine learning library • Hive adapter 9 Hadoop, C*, HBase, Avro KijiSchema KijiMR KijiREST KijiHive KijiScoring KijiExpress
  10. 10. Kiji real-time components • REST server • Scoring server 10 Hadoop, C*, HBase, Avro KijiSchema KijiMR KijiREST KijiHive KijiScoring KijiExpress
  11. 11. Kiji Summary • Bridge between open-source technologies and real-time, big data applications • Users are building real systems with Kiji now! – Personalized recommendation systems for retail – Energy usage and analytics reporting 11
  12. 12. The Kiji data model and KijiSchema 12
  13. 13. row 13 Table are composed of rows.
  14. 14. entity ID data 14 We call row keys “entity IDs.”
  15. 15. data0xfa “bob” 15 We support composite entity IDs (with hashed and unhashed components).
  16. 16. info0xfa “bob” songs 16 Data in rows is organized into “column families.”
  17. 17. songs: let it be songs: help songs: helter skelter 0xfa “bob” info: email info: payment 17 Column families contain columns, named as “family:qualifier.”
  18. 18. songs: let it be songs: help songs: helter skelter 0xfa “bob” info: email info: payment songs: let it besongs: let it besongs: let it be songs: let it be 1396560123 18 Individual columns can have many different timestamped versions.
  19. 19. songs: let it be songs: help songs: helter skelter 0xfa “bob” info: email info: payment songs: let it besongs: let it besongs: let it be songs: let it be 1396560123 19 Data values can be complex records record SongPlay { long song_id; int user_rating; long session_id; device_type device; }
  20. 20. 20 Locality groups Separate logical organization of data (column families) from physical attributes (caching, compression, etc.) info songs_todayentity ID songs_prev_year
  21. 21. 21 Locality groups Separate logical organization of data (column families) from physical attributes (caching, compression, etc.) Need this data ASAP for real-time scoring. Use this data only for batch jobs. info songs_todayentity ID songs_prev_year
  22. 22. info songs_todayentity ID songs_prev_year “real_time” (in-memory, uncompressed, TTL = 1 day) “batch” (compressed, TTL = 12mo) 22 Locality groups Always refer to columns by logical name (“family:qualifier”). Need this data ASAP for real-time scoring. Use this data only for batch jobs.
  23. 23. KijiSchema summary • Data model similar to Cassandra, HBase, BigTable • Contains time dimension (not present in C*) • Logical and physical organization separate • Complex schemas with Avro 23
  24. 24. Mapping Kiji to Cassandra 24
  25. 25. Implementation notes 25 • Built for Cassandra 2.0.6+ • Native protocol / Java driver (no Thrift) • Asynchronous API • Assume users have Hadoop, ZooKeeper
  26. 26. Mapping a Kiji table ➔ Cassandra • Locality group ➔ Table • Entity ID ➔ Primary key – Hashed components ➔ partition key – Unhashed components ➔ clustering columns • Family, qualifier, timestamp ➔ clustering columns • Cell values ➔ blobs 26 songs: let it be songs: help songs: helter skelter 0xfa “bob” info: email info: payment songs: let it besongs: let it besongs: let it be songs: let it be 1396560123
  27. 27. CQL for Kiji locality group CREATE TABLE users_locality_group_fast ( userid bigint, user text, family text, qualifier text, timestamp bigint, value blob, PRIMARY KEY (userid, username, family, qualifier, timestamp) ) WITH CLUSTERING ORDER BY ( username ASC, family ASC, qualifier ASC, timestamp DESC); 27 TODO: Show row diagram, arrows pointing to components?
  28. 28. 28 cqlsh:kiji_music>SELECT * FROM kiji_table_users; userid | username | family | qualifier | timestamp | value --------+----------+--------+----------------+-----------+--------------- 0xfa | bob | info | email | 139653249 | 1243970104327 0xfa | bob | songs | abbey road | 139656012 | 0981274331032 0xfa | bob | songs | help | 139625013 | 9074132704129 0xfa | bob | songs | help | 139621359 | 1923079210370 0xfa | bob | songs | help | 139625013 | 4745018223497 0xfa | bob | songs | helter skelter | 139621324 | 7710423974234
  29. 29. Physical organization of data on disk 29 songs: let it be songs: help songs: helter skelter 0xfa “bob” info: email info: payment songs: let it besongs: let it besongs: let it be songs: let it be 1396560123 0xfa:bob:info:email:t0:bob@gmail.com 0xfa:bob:info:payment:t1:AMEX1234... 0xfa:bob:songs:let it be:t5:... 0xfa:bob:songs:let it be:t4:… 0xfa:bob:songs:let it be:t2:… 0xfa:bob:songs:help:t2:… 0xfa:bob:songs:helter skelter:t1:… Efficient queries = continuous scans!
  30. 30. Kiji queries ➔ CQL queries All data in “info” column family for “bob” ➔ SELECT qualifier, value FROM music WHERE userid=0xfa AND user=‘bob’ AND family=‘info’; 30 songs: let it be songs: help songs: helter skelter 0xfa “bob” info: email info: payment songs: let it besongs: let it besongs: let it be songs: let it be 1396560123
  31. 31. Kiji queries ➔ CQL queries Data in “info:email” and last play of “help” for “bob” ➔ SELECT value FROM music WHERE userid=0xfa AND user=‘bob’ AND family=‘info’ AND qualifier=‘email’; SELECT value FROM music WHERE userid=0xfa AND user=‘bob’ AND family=‘songs’ AND qualifier=‘help’ LIMIT 1; 31 songs: let it be songs: help songs: helter skelter 0xfa “bob” info: email info: payment songs: let it besongs: let it besongs: let it be songs: let it be 1396560123
  32. 32. Kiji queries ➔ CQL queries All songs played by “bob” on April 2nd ➔ SELECT qualifier, value FROM music WHERE userid=0xfa AND user=‘bob’ AND family=‘songs’ AND timestamp >= 1396396800 AND timestamp <= 1396483200 ALLOW FILTERING; 😱😱 32 songs: let it be songs: help songs: helter skelter 0xfa “bob” info: email info: payment songs: let it besongs: let it besongs: let it be songs: let it be 1396560123
  33. 33. Kiji queries ➔ CQL queries 33 songs: let it be songs: help songs: helter skelter 0xfa “bob” info: email info: payment songs: let it besongs: let it besongs: let it be songs: let it be 1396560123 ! Bad Request: PRIMARY KEY part timestamp cannot be restricted (preceding part qualifier is either not restricted or by a non-EQ relation)
  34. 34. Queries that do not map well to CQL • Break up into multiple CQL queries – Hooray for Session#executeAsync! • Filter on the client – Potentially very expensive, but functional – Provide warning to user • Educate users about table layout – Layout in previous example is terrible for that query • Most issues related to “time” dimension 34
  35. 35. MapReduce • Wrote new InputFormat, OutputFormat • Hadoop 2.x • Multiple C* queries per RecordReader • Does not use Thrift 35
  36. 36. Project status and next steps 36
  37. 37. Initial release in ~ 2 weeks 37 • Cassandra as part of the Bento Box • Cassandra working in KijiSchema, KijiMR
  38. 38. Support in the coming months • Cassandra integration with KijiREST, KijiScoring, KijiExpress, etc. • Expose Cassandra-specific features to users – Variable consistency levels – Load-balancing policies – Diagnostics (e.g., route tracing) • Kiji support in CQLSH – Decode Avro values 38
  39. 39. Thanks to Cassandra community • Great help on mailing lists for users, dev, java driver • Webinars, meetups, C* Summit all available online • Free training from DataStax • Very easy to get up-to-speed 39
  40. 40. Try it now -- Kiji Bento Box • Latest compatible versions of all components • Hadoop, ZooKeeper, HBase • Cassandra in ~2 weeks 40 www.kiji.org/getstarted Mention hiring?
  41. 41. KijiSchema • Schemas and data serialization • Complex data types (e.g., nested maps) • Schema evolution • Metadata • Composite row keys • Transparent paging • Data-definition language, REPL 41 Hadoop, C*, HBase, Avro KijiSchema KijiMR KijiREST KijiHive KijiScoring KijiExpress
  42. 42. 42 Schema support Support for complex schemas with Avro record UserLog { long timestamp; int user_id; string url; } KijiSchema allows schema versioning
  43. 43. 43 Column name translation •“family:qualifier” -> “A:B” •Saves disk space •Improves performance •User-facing tools translate names •Possible to turn this off
  44. 44. Kiji queries ➔ CQL queries All data in family “songs” for user “bob” ➔ SELECT qualifier, value FROM music WHERE userid=0xfa AND user=‘bob’ AND family=‘songs’; 44 songs: let it be songs: help songs: helter skelter 0xfa “bob” info: email info: payment songs: let it besongs: let it besongs: let it be songs: let it be 1396560123
  1. A particular slide catching your eye?

    Clipping is a handy way to collect important slides you want to go back to later.

×