Instant Cassandra Query Language - Good Book - Short time Learning


Published on

Published in: Education, Technology
1 Comment
No Downloads
Total views
On SlideShare
From Embeds
Number of Embeds
Embeds 0
No embeds

No notes for slide

Instant Cassandra Query Language - Good Book - Short time Learning

  1. 1. Table of Contents Instant Cassandra Query Language Credits About the Author About the Reviewers Support files, eBooks, discount offers and more Why Subscribe? Free Access for Packt account holders 1. Instant Cassandra Query Language So, what is CQL? Where will you use CQL? Installation Step 1 – what do I need? Step 2 – download Cassandra Step 3 – install Cassandra Step 4 – configure Cassandra Step 5 – start Cassandra server What may go wrong? And that's it Quick start – running CQL queries Step 1 – start CQL shell (cqlsh) On Unix On Windows Install and configure Python Run cqlsh Step 2 – create and use a keyspace Step 3 – describe and list keyspace Step 4 – create table and insert records Step 5 – display records How is data actually stored? Step 6 – deleting data DELETE TRUNCATE USING TTL Dropping TABLE and KEYSPACE CQL versions Step 7 – building facebook Creating keyspace for facebook Storing facebook data using tables Storing user profiles Create table Insert records
  2. 2. Update row View rows Alter table (add collections columns) Storing facebook posts Create table Add records Creating table for user wall (timeline) How data is stored Connecting users with their friends Counting friends and posts Top 3 features you need to know about TTL (time to live) Syntax Gone in 60 seconds! CQL for client applications Using the Thrift API Using the Datastax Java driver Using high-level clients Batch operations Syntax Three flavors of the batch operation People and places you should get to know Official sites Articles and tutorials Community Blogs Twitter
  3. 3. Instant Cassandra Query Language
  4. 4. Instant Cassandra Query Language Copyright © 2013 Packt Publishing All rights reserved. No part of this book may be reproduced, stored in a retrieval system, or transmitted in any form or by any means, without the prior written permission of the publisher, except in the case of brief quotations embedded in critical articles or reviews. Every effort has been made in the preparation of this book to ensure the accuracy of the information presented. However, the information contained in this book is sold without warranty, either express or implied. Neither the author, nor Packt Publishing, and its dealers and distributors will be held liable for any damages caused or alleged to be caused directly or indirectly by this book. Packt Publishing has endeavored to provide trademark information about all of the companies and products mentioned in this book by the appropriate use of capitals. However, Packt Publishing cannot guarantee the accuracy of this information. First published: September 2013 Production Reference: 1230913 Published by Packt Publishing Ltd. Livery Place 35 Livery Street Birmingham B3 2PB, UK. ISBN 978-1-78328-271-5
  5. 5. Credits Author Amresh Singh Reviewers Theo Hultberg Courtney Robinson Acquisition Editors Sam Birch Andrew Duckworth Commissioning Editor Yogesh Dalvi Technical Editors Krishnaveni Haridas Aman Preet Singh Ankita Thakur Project Coordinator Romal Karani Proofreader Clyde Jenkins Graphics Yuvraj Mannari Production Coordinator Adonia Jones Cover Work Adonia Jones
  6. 6. Cover Image Sheetal Aute
  7. 7. About the Author Amresh Singh is an engineering graduate and has more than nine years of software development experience in Java SE/EE technologies. He has vast industry experience in server-side web development using Struts, JSF, Spring, Hibernate, RDBMS, and so on. He is core committer in the open source project, Kundera, which is a JPA 2.0-based object-data store mapping library for NoSQL databases, such as Cassandra, HBase, MongoDB, and OracleNoSQL. He has worked with reputed development and consulting IT organizations in business domains ranging from telecom to finance, and insurance. He is currently working as Lead Developer in Innovation Labs of Impetus Infotech (India) Pvt. Ltd. He has undertaken speaking engagements in Cloud Camp. He loves writing, and is a DZone Most Valuable Blogger. You can follow his blog at I would like to thank my parents (Shanti, Nand), my family members (Rachana, Rashmi, Vijay, Namrata, Aryaj), and all those who have persistently and patiently shown faith in me, and have been a continuous source of encouragement and support to me. Many thanks to my team members, managers, trainers, and mentors, who supported and helped me grow professionally throughout my career. This book wasn't possible without the hard work and commitment from those who worked tirelessly with me from Packt Publishing. Big thanks to Mohammand Rizvi, Yogesh Dalvi, Romal Karani, and their entire team at Packt. Lastly, friends I found on my life's journey have been valuable asset to me. Thank you guys!
  8. 8. About the Reviewers Theo Hultberg is the Chief Architect at Burt, where he designs and builds distributed systems for analytics and big data processing. He's an avid open source contributor, and is the author of cql-rb, the CQL3 driver for Ruby. Courtney Robinson has more than six years of experience, with more than two years in the commercial field, and the remaining four years as a freelancer. He's worked on various projects across multiple languages and platforms, including Java, PHP, JavaScript, C#, Scala, Haskell, Cassandra, and Hadoop. He's been a Cassandra user since 2008, not long after it was open sourced. He was regular at the Cassandra London meeting, and has even given two talks there on CQL. He has a vested interest in the NoSQL space, currently a PhD student researching the applications and implementation of a distributed graph database with Cassandra's scalability and Neo4J's graph properties. Now, he is co-authoring the upcoming Netty in Action book from Manning publications.
  9. 9. Support files, eBooks, discount offers and more You might want to visit for support files and downloads related to your book. Did you know that Packt offers eBook versions of every book published, with PDF and ePub files available? You can upgrade to the eBook version at and as a print book customer, you are entitled to a discount on the eBook copy. Get in touch with us at <> for more details. At, you can also read a collection of free technical articles, sign up for a range of free newsletters and receive exclusive discounts and offers on Packt books and eBooks.
  10. 10. Do you need instant solutions to your IT questions? PacktLib is Packt's online digital book library. Here, you can access, read and search across Packt's entire library of books. Why Subscribe? Fully searchable across every book published by Packt Copy and paste, print and bookmark content On demand and accessible via web browser
  11. 11. Free Access for Packt account holders If you have an account with Packt at, you can use this to access PacktLib today and view nine entirely free books. Simply use your login credentials for immediate access.
  12. 12. Chapter 1. Instant Cassandra Query Language Welcome to Instant Cassandra Query Language. This book has been especially designed for those who want to learn Cassandra's query language (CQL) in a simple, concise, and quick way. This includes CQL basics, its importance in Cassandra's future, step-by-step instructions, several basic and advanced tips and tricks on indexing, batch operations, binary protocol, and so on. This document contains the following sections: So, what is CQL? helps you to know what CQL is, how it evolved, why it is needed, and what lies in its future. Installation helps you install Cassandra server on your machine, letting you run CQL queries from cqlsh console. Quick start – running CQL queries helps you get started quickly by running basic CQL commands and seeing results immediately on console. We'll build a small example application and explain CQL syntax, keywords, and constructs used. Top 3 features you need to know about lists CQL features ranging from the ones that you would be using frequently to those that are advanced and would help you stay ahead of the curve. People and places you should get to know lists useful links to tutorials, presentations, articles, blogs as well as official sites, discussion forums, and a twitter handle you may want to follow in order to keep yourself up-to-date. So, what is CQL? Cassandra Query Language or CQL for short (pronounced as si-kw-l) is a structured query language for Apache Cassandra database (we assume you already have an idea of what Cassandra is). Thrift RPC API and Avro are other two popular ways of interacting with Cassandra. However, both of these libraries have been unable to keep pace with rapid changes in Cassandra world with each newer version. CQL is very much like SQL (well, more or less, except where it had to deviate). It hides complexities of Cassandra's internal storage structure from users and provides SQL-like user friendly interface enabling us to focus on our business applications. CQL has been available in Cassandra since version 0.8.0, has matured a lot, and is still evolving, changing fast, but is production-ready. This book uses version 3.x of CQL, which is incompatible with 2.x (and 1.x). CQL 3 is
  13. 13. available with Cassandra 1.2.x or newer. Where will you use CQL? We will use CQL as a query language on Cassandra's CQL console or cqlsh, which is a replacement of Cassandra's old command-line interface called Cassandra-cli. We will also use it for writing client applications with Cassandra binary protocol using a compatible driver. This protocol is custom-made for Cassandra as a standard for clients to interact with databases using CQL. Note In this book, we'll use Cassandra binary protocol for writing applications, but there is also an older thrift protocol that you might have used in the past. If you're unfamiliar with Thrift, relax! It's not required for learning CQL. In a nutshell, Thrift is a low-level RPC-based API for interacting with distributed databases and has been widely used with Cassandra. Its use is going to fade away because of userunfriendliness, being too low-level and difficulty in evolving with rapid changes in Cassandra.
  14. 14. Installation In order to use and run CQL queries, you would need to install Apache Cassandra. You can install, configure, and get it running on your machine in five easy steps. Step 1 – what do I need? In real-world systems, Cassandra would be installed on multiple nodes with much better hardware. However, for running examples explained in this book, we will install it on single node. The following are the prerequisites: Java Version 1.6 or never installed on your machine, preferably Oracle JVM. (OpenJDK, IBM JVM, and jRockit would also work fine). Set JAVA_HOME environment variable to the directory where Java is installed. Step 2 – download Cassandra The easiest way to download Cassandra is as a compressed package (apache-apachecassandra-x.y.z-bin.tar.gz) from the link It's always better to download and install most current version of the package, which at the time of writing this book is 1.2.5. Step 3 – install Cassandra Uncompressing tar package is all it takes to install Cassandra on your machine. Uncompress the Cassandra package you downloaded to directory of your choice. Use the following command to perform this in Unix: tar –zxvf apache-cassandra-x.y.z-bin.tar.gz (On windows, you would need to use a separate tool for uncompressing the tarball. 7-Zip and gzip would be at your rescue.) From now onward, we'll refer to the directory created after uncompressing this file as CASSANDRA_HOME. Step 4 – configure Cassandra Once you have unpacked Cassandra installer, the folder structure would look something like the following screenshot:
  15. 15. Single-node setup requires minimum configuration. Configuration files in Cassandra are kept in conf folder under CASSANDRA_HOME. Open conf/cassandra.yaml and check value for the following keys: saved_caches_directory data_file_directories commitlog_directory Make sure that these keys are pointing to correct directories, and that current user has read/write permissions. Step 5 – start Cassandra server Just change the directory to bin folder under startup script using the following command: CASSANDRA_HOME, and run the Cassandra server cd CASSANDRA_HOME/bin ./cassandra Or, if you want Cassandra to run in foreground, modify the script as follows: ./cassandra –f What may go wrong? If you receive the error
  16. 16. If you receive the error /var/log/cassandra/system.log (No such file or directory) while starting Cassandra, you'll have to create the cassandra folder manually under /var/log. If you receive a heap size related error HeapDumpOnOutOfMemoryError during Cassandra startup, adjust heap size for Cassandra: On Linux-based systems, this may be done by editing conf/ and changing Xss value to a higher value, like 180 in the following statement: JVM_OPTS="$JVM_OPTS -Xss180k" On Windows, you would need to edit correctly in JAVA_OPTS. bin/cassandra.bat and set Xms variable And that's it By this point, you should have a working installation of Cassandra, and are free to play around and discover more about it, and yes, CQL.
  17. 17. Quick start – running CQL queries Welcome to the world of CQL! Now, since you have a working installation of Cassandra ready, it's time to learn and practice with CQL in a step-by-step manner. Let's open a CQL console and run a few most basic queries in order to get a feel for it. Hello CQL First things first; let's see how to start console, and then move on to running queries on it for reading/writing data. Step 1 – start CQL shell (cqlsh) is a Python-based command-line client for Cassandra. It allows you to run CQL queries directly onto Cassandra database. cqlsh On Unix On a Unix-flavored machine, you have virtually nothing to do in order to start cqlsh. CASSANDRA_HOME/bin>./cqlsh Note Prior to Cassandra 1.2, default CQL version was 2.0 and we were required to provide -3 switch for running CQL 3.0 queries. By default, Cassandra is started on localhost at port 9160. You can provide different values on the command line if you wish, similar to the following command: CASSANDRA_HOME/bin>./cqlsh <host> <port> On Windows Starting cqlsh on Windows requires installing Python, and then running cqlsh as an option with it. Install and configure Python Python can be installed and configured on Windows in three easy steps: 1. Install Python from (Choose Windows x86 MSI Installer option) and set python.exe into your path environment variable. Latest version of python may not work correctly due to incompatibility.
  18. 18. 2. Install Python thrift module from and run from the install location given as follows: thrift-0.9.0python install 3. Install the Python CQL module using the following command: CASSANDRA_HOME/pylib/python install Run cqlsh Goto bin folder under following command: CASSANDRA_HOME, and run Python with cqlsh as an option using the CASSANDRA_HOMEbin>python cqlsh Step 2 – create and use a keyspace in Cassandra is the equivalent of database/schema in relational databases. While creating a keyspace, you need to specify replication settings: Keyspace cqlsh> CREATE KEYSPACE testkeyspace WITH replication = {'class':'SimpleStrategy','replication_factor':1}; We'll be using SimpleStrategy to keep things simple, because our Cassandra setup is just single node. In production environment, where it's usually common to have multiple data centers, NetworkTopologyStrategy is generally used because it distributes data across data centers better. Replication factor = 1 means there will be single copy of a row on a particular node. Higher replication factors are set up in real systems for creating multiple replicas that ensures data availability in case of disk failures. To be able to work with tables, you need to use your keyspace, as shown in the following statement: cqlsh> USE testkeyspace; Another option is to prefix the table name with the keyspace name in all queries. Step 3 – describe and list keyspace At any time, you can DESCRIBE the keyspace, use the following command to do that: cqlsh> DESCRIBE KEYSPACE testkeyspace;
  19. 19. Note Unlike SELECT, CREATE, USE and others, DESCRIBE is not a command in Cassandra Query Language. Rather, it's a cqlsh feature that lets you view details of an object. If you wish to list all keyspaces present in the database, a Cassandra reserve keyspace named system comes in handy. It contains many system-defined tables for database objects definition and cluster configuration. Let's list all records from the schema_keyspaces table that contains records for each keyspace, using the following command: cqlsh> SELECT * FROM system.schema_keyspaces; The output of this command looks as follows: Step 4 – create table and insert records Run the following query on cqlsh console for creating a table called users. For those who are acquainted with SQL, the following syntax should look very familiar and almost identical. CREATE TABLE users( user_id varchar, age int, email varchar, city varchar,PRIMARY KEY (user_id)); While creating a table, you provide names of each column with their data type. In Cassandra, an INSERT operation is actually an "Upsert" (UPDATE or INSERT), which means columns are updated in case a row for the given primary key already exists, otherwise all columns are freshly inserted.
  20. 20. INSERT INTO users(user_id, age, email, city)VALUES ('jsmith',32,'','Dallas'); INSERT INTO users(user_id, city) VALUES ('pscott','New Jersey'); INSERT INTO users(user_id, age, email) VALUES ('davidb',37,''); You can view a table metadata using the DESCRIBE command, as shown in the following statement: cqlsh> DESCRIBE TABLE users; Note Keywords in CQL-3 are case-insensitive, which means INSERT, insert, Insert, iNSerT are all same and will work. Throughout this book, keywords are shown in capital letters in order to separate them from other constructs of CQL statement. All other identifiers, like Table name, column name, and so on are automatically converted to lower case, unless they are put in double quotes. Step 5 – display records SELECT command lists content of a table; a filter can be applied using the WHERE clause: cqlsh>SELECT * FROM users; cqlsh:testkeyspace> SELECT * FROM users WHERE user_id='jsmith'; How is data actually stored? Result of previous SELECT queries may make you wonder—Oh! This looks so similar to RDBMS tables, is this how data is stored in Cassandra as well? The answer to the preceding questions is—although CQL gives you a nice looking interface
  21. 21. similar to a RDBMS you are used to, data is stored in Cassandra according to its own datamodel. A graphical representation of data stored in Cassandra as a result of preceding INSERT queries is shown in the following diagram: In Cassandra, a Keyspace is container for your application data, and can be compared with schema/database in RDBMS. In keyspaces, you would have Tables (formerly known as column families). Each table has multiple rows identified by a Primary Key (earlier called as row key). Cassandra is a schema-optional database. While a CREATE TABLE statement appears to suggest that a table has fixed number of columns in each row, they actually may have an arbitrary number of columns, as is shown above. Unlike RDBMS, Cassandra tables are usually denormalized by data modellers to avoid runtime joins. This introduces redundancies in data, but gives better performance. Primary keys may have multiple components. Such keys are called Composite Keys, and are stored in clusters differently. Composite keys are discussed later in the book. Step 6 – deleting data Data can be deleted using CQL in one of the following ways: DELETE: It deletes column(s) or entire row(s) TRUNCATE: It deletes all rows from the table USING TTL: It sets time to live on column(s) within a row; after expiration of specified period, columns are automatically deleted by Cassandra.
  22. 22. DELETE The DELETE command is used to delete columns(s) or row(s). The following command deletes city and email columns for the primary key davidb: DELETE email FROM users WHERE user_id = 'davidb'; Don't specify any column after the DELETE keyword if you want to delete the entire row: DELETE FROM users WHERE user_id = 'pscott'; Below is the output of above two deletions: SELECT * FROM users; Note If you've worked with Thrift before, you'll remember that rows get displayed on Cassandracli even though they have been deleted. Such rows are called "Range-ghosts". This happens because when a column or row is deleted from Cassandra (either using DELETE or expiration of TTL); it's not deleted from disk immediately. They are marked with what we call "tombstone", which means they are eligible for deletion after first compaction following DELETE. CQL just ignores tombstone rows and columns, and this is not a problem. TRUNCATE If you ever need to remove all records from a table, CQL has the following shorthand: TRUNCATE <Table name>; Try the following and see yourself; none of the records previously inserted should be returned: TRUNCATE users; SELECT * FROM users;
  23. 23. USING TTL Another way to delete data in Cassandra is to specify an expiration time named TTL (time to live) while inserting data. We'll explore this feature in the later section—Top 3 features you need to know about. It's too early to get into those things, right? Dropping TABLE and KEYSPACE Tables and keyspaces can be dropped by running: DROP TABLE users; DROP KEYSPACE testkeyspace; The DROP operation is complete, immediate, and recursive in nature. All data they hold is completely removed with immediate effect. CQL versions CQL Versions follow an XYZ naming convention, where: X represents the major version (represents major changes, usually backward incompatible) Y represents the minor version (represents changes that are backward compatible) Z represents patch (bug fixes) The following screenshot explains in brief this exciting journey of CQL version changes: Step 7 – building facebook So now, since we've got firsthand experience of running some basic queries on cqlsh, it's time to explore more CQL constructs in depth. Following the Learn by Example principle, we'll design our data model for an application Cassandra was originally written for and later donated to Apache. Yes, we'll be building a data model for a Facebook-like application.
  24. 24. Well, I wish we could build an exact clone that rivals this giant in 50 pages, but nevertheless, it will store user profile and their social connections, their posts (only text ones for simplicity) and a timeline. Creating keyspace for facebook We'll need only one keyspace, and all tables would be stored inside it. CREATE KEYSPACE facebook WITH replication = {'class':'SimpleStrategy','replication_factor':1}; USE facebook; DESCRIBE KEYSPACE facebook; Replication settings have been kept simple, because this is a single node setup for sample application.
  25. 25. Storing facebook data using tables We'll be creating the following listed tables for storing data related to this application: UserProfiles Posts Timeline Friends UserCounter Storing user profiles Users' profile data would be stored in the UserProfiles table. Let's look at steps for creating this table and reading/writing data into it. Create table The following statement is a DDL statement for table creation. Here, email_id is specified as primary key. Because email addresses are universally unique, this serves as a good unique identifier for our purpose. CREATE TABLE UserProfiles(email_id text,password text,name text,age int,profile_picture blob,PRIMARY KEY(email_id)); Here text, int, and blob are data types corresponding to their respective column names. Blob is used to store binary data, like pictures, music, and video files. A complete list of CQL data types is worth examining, because we'll be using them throughout this book, so you better know them!
  26. 26. Insert records Now that the table is created, let's insert a couple of records: INSERT INTO UserProfiles(email_id, password, name, age, profile_picture) VALUES('','p4ssw0rd','John Smith',32,0x8e37); INSERT INTO UserProfiles(email_id, password, name, age, profile_picture) VALUES('','guess1t','David Bergin',42,0xc9f1); INSERT INTO UserProfiles(email_id, password, name, age, profile_picture) VALUES('','s3cr3t','Angela Thomas',35,0xe8a0); INSERT INTO UserProfiles(email_id, password, name, age, profile_picture) VALUES('','p4hn4H1','Sarah Miller',24,0x5fa1); Update row Ooops, Angela just realized that the profile picture she just uploaded doesn't look very good and wants to update it. She also decides to change her age to 20 (wish we could do that in real). The following command shows how to do it: UPDATE UserProfiles SET age=20, profile_picture=0x5d1b WHERE email_id = ''; View rows In order to view records we created, run the following SELECT queries: SELECT * FROM UserProfiles; SELECT * FROM UserProfiles WHERE email_id = ''; This results in the following output:
  27. 27. The preceding SELECT query worked because the WHERE clause contained a condition on the primary key column. Imagine users of our application searching other users by their name: SELECT * FROM UserProfiles WHERE name='John Smith'; Try running this, and you'll get this punch on your face, as shown in the following screenshot: The reason is that you need to create secondary indexes on non-key columns in order to run queries on them. Cassandra stores indexes as separate hidden tables. It is created using the following command statement: CREATE INDEX ON UserProfiles(name); And voila! This works like a charm. Note Indexes created have a name in the following format: <Table name>_<Column name>_idx For example, the above command created an index with the name userprofiles_name_idx. All indexes along with other table metadata can be seen using the DESCRIBE TABLE command. You can create indexes on as many columns as you wish, and include multiple columns in the WHERE clause. For example:
  28. 28. CREATE INDEX ON UserProfiles(age); SELECT name, age FROM UserProfiles WHERE age > 30 AND name='John Smith' ALLOW FILTERING; One thing to note here is that queries that are detected as potentially expensive and returning non-contiguous rows (usually involving conditional operators like >, >=, <, and <=), may throw the following exception: Bad Request: Cannot execute this query as it might involve data filtering and thus may have unpredictable performance. If you want to execute this query despite the performance unpredictability, use ALLOW FILTERING In such queries, it is required to append ALLOW FILTERING. Alter table (add collections columns) Cassandra provides three types of collection data types, namely set, list, and map. Collections are useful when storing a collection of values in a single field, for example, tags for a blog, recipients of an email, user's custom preferences, and so on. Collection Used for Example List Storing unordered unique values, Returns elements in their natural ordering(alphabetical for text) places_lived= ['Connecticut','New York','Washington'] Set Storing values in insertion order, allowing duplicates, Returns elements according to their index in list. blog_tags= {'cql','cassandra','cql3.0'}
  29. 29. activity={'2013-07-01 18:07:35': 'Checked in to Restaurant', '2013-07-02 07:51:12': 'Liked Storing key-values pairs. Keys have to be unique. Maria's post' Map '2013-07-02 13:43:41': 'Updated city'} As shown in the table above, List values are represented in large brackets, Set values in curly braces, and Map key-values separated by colon in curly braces. Let's add a List column to UserProfiles and update data: ALTER TABLE UserProfiles ADD places_lived list<text>; Here, string inside angle brackets denotes data type of values that this list contains. Note Nesting of collections is not supported. For example: are not allowed and will throw an exception. list<list<text>> or set<list<int>> In the following commands, we are adding some values in the newly introduced list column: UPDATE UserProfiles SET places_lived = ['New York','Connecticut','London']WHERE email_id=''; UPDATE UserProfiles SET places_lived = ['New Delhi','Paris']WHERE email_id=''; UPDATE UserProfiles SET places_lived = ['Moscow']WHERE email_id=''; UPDATE UserProfiles SET places_lived = ['Perth','Berlin','Wellington','Dubai'] WHERE email_id=''; SELECT email_id,name, places_lived FROM UserProfiles;
  30. 30. Note Secondary index on elements inside Collections is not possible as of now, but is likely to be supported in later versions.
  31. 31. Storing facebook posts Once users signup in our Facebook application,the next thing they would want to do is post something they find interesting. The Posts table will be storing this data. Create table Table named Posts will store contents posted by all users. At this point, it's not specified who posted what; we'll get to that later. We used list data type in UserProfiles table earlier; In Posts, we'll use remaining collection types, set and map for storing likes and comments without which a post is useless. (For simplicity, let's assume a user comments on a post only once.) This is done using the following command: CREATE TABLE Posts (post_id timeuuid,body text, device text,posted_at timestamp,liked_by set<text>, commented_by map<text, text>,PRIMARY KEY(post_id)); Add records Let's add some records into this table using the following command: INSERT INTO Posts(post_id, body, device, posted_at) VALUES (17cd4be0-eb1f-11e2-8f75-55116e6864a0, 'My first post','Web','2013-07-01 06:32:47'); INSERT INTO Posts(post_id, body, device, posted_at) VALUES (17d86f70-eb1f-11e2-8f75-55116e6864a0, 'My second post','iPhone','2013-07-03 13:16:31'); INSERT INTO Posts(post_id, body, device, posted_at) VALUES (18e99150-eb1f-11e2-8f75-55116e6864a0, 'My third post','Android','2013-07-03 11:45:10'); Note in these insertions are standard type 1, timeuuid values. If you want to generate unique timeuuid values at the time, the INSERT statement is run; the now() function would also come in handy. For example: The post_id INSERT INTO Posts(post_id, body, device, posted_at) VALUES (now(),'My first post','Web','2013-07-01 06:32:47'); After insertion, it'll look similar to the following screenshot:
  32. 32. SELECT post_id,body,device FROM Posts; Updating rows with likes and comments is done using the following command: UPDATE Posts SET liked_by= {'',''}WHERE post_id=17cd4be0-eb1f-11e2-8f75-55116e6864a0; UPDATE Posts SET liked_by={''}WHERE post_id=17d86f70-eb1f-11e2-8f75-55116e6864a0; UPDATE Posts SET commented_by={'':'awesome post!','':'really?'}WHERE post_id=17cd4be0eb1f-11e2-8f75-55116e6864a0; UPDATE Posts SET commented_by={'':'Hahaha'} WHERE post_id=17d86f70-eb1f-11e2-8f75-55116e6864a0; SELECT body,liked_by FROM Posts; SELECT body, commented_by FROM Posts; Creating table for user wall (timeline) A user's wall, or Timeline, displays the full name and their posts for this user and his/her friends. Had it been an RDBMS, we could have joined UserProfile and Posts tables at the time Timeline page was to be displayed. In NoSQL databases like Cassandra, a join would be too costly, and we usually denormalize data that are queried frequently. Therefore, we would create a table dedicated for Timeline:
  33. 33. CREATE TABLE Timeline(email_id text,post_id timeuuid, name text,body text,device text,posted_at timestamp, liked_by set<text>,commented_by map<text,text> ,PRIMARY KEY(email_id,post_id)); If you notice, primary key is a combination of two columns instead of one. This is called composite key. The first column (email_id) is called Partition Key, while the remaining columns (post_id, just one, in this case) are Clustering Keys. All columns in a row that share the same partition key are physically stored on the same node. Composite key is a powerful feature of Cassandra. It offers many benefits, including data colocality and performance, among others. Unlike RDBMS, there is no trigger like mechanism in Cassandra yet (though that's likely to be released with Cassandra 2.0). This means we'll have to insert records in Timeline ourselves while inserting into Posts. Let's do that. Use the following commands: INSERT INTO Timeline (email_id,post_id,name,body,device,posted_at,liked_by,commented_by) VALUES('',17cd4be0-eb1f-11e2-8f7555116e6864a0,'Angela Thomas','My first post','Web','2013-07-01 06:32:47', {'',''}, {'':'awesome post!','':'really?'}); INSERT INTO Timeline (email_id,post_id,name,body,device,posted_at,liked_by,commented_by) VALUES('',17d86f70-eb1f-11e2-8f7555116e6864a0,'Sarah Miller','My second post','iPhone','2013-07-03 13:16:31', {''},{'':'Hahaha'}); INSERT INTO Timeline (email_id,post_id,name,body,device,posted_at) VALUES('',18e99150-eb1f-11e2-8f7555116e6864a0,'John Smith','My third post','Android','2013-07-03 11:45:10'); on both partition and clustering keys are possible (though you need to use allow filtering while searching using cluster keys due to reasons explained earlier). This is shown in the following commands, and the output in the following screenshot: SELECT
  34. 34. SELECT email_id,post_id,name,body FROM Timeline; SELECT email_id,post_id,name,body FROM Timeline WHERE email_id=''; SELECT email_id,post_id,name,body FROM Timeline WHERE post_id=17d86f70-eb1f-11e2-8f75-55116e6864a0 ALLOW FILTERING; How data is stored Cassandra is called a partitioned row store. Data for each partition key are clustered by the columns in the clustering keys list. All columns for a given partition key are guaranteed to be on a similar physical machine. If primary keys have a single component, all columns in a row have single partition key, and hence they are stored on a single machine. However, in case of composite keys, columns in a row are distributed across machines. A graphical representation of data stored above is shown below:
  35. 35. As we can see, for each partition key there are composite columns formed using clustering keys and non-primary columns. The Cassandra storage engine guarantees that columns are stored sequentially for a given partition key ordered by clustering keys (post_id in this case). This makes it possible to search rows based on all components of primary key. Connecting users with their friends What's a social networking without friends and connections? We'll store connections between users and their connections in a table named Friends. The following command is the DDL definition: CREATE TABLE Friends(user_email_id text,friend_email_id text,friends_since timestamp,subscribed_to_posts boolean,PRIMARY KEY(user_email_id,friend_email_id)); Like Timeline, This table also uses composite key comprising the user's email ID and their friend's email ID. Let's insert some records:
  36. 36. INSERT INTO Friends(user_email_id, friend_email_id, friends_since,subscribed_to_posts) VALUES ('','','201302-10 09:12:43+0000',true); INSERT INTO Friends(user_email_id, friend_email_id,friends_since, subscribed_to_posts) VALUES ('','','201302-10 09:12:43+0000',true); INSERT INTO Friends(user_email_id, friend_email_id,friends_since, subscribed_to_posts) VALUES ('','','2013-01-14 17:28:51+0000',false); INSERT INTO Friends(user_email_id, friend_email_id,friends_since, subscribed_to_posts) VALUES ('','','2013-01-14 17:28:51+0000',true); INSERT INTO Friends(user_email_id, friend_email_id,friends_since, subscribed_to_posts) VALUES ('','','201303-31 12:37:56+0000',true); INSERT INTO Friends(user_email_id, friend_email_id, friends_since, subscribed_to_posts) VALUES ('','','201303-31 12:37:56+0000',true); INSERT INTO Friends(user_email_id, friend_email_id,friends_since, subscribed_to_posts) VALUES ('','','201305-09 18:19:07+0000',true); INSERT INTO Friends(user_email_id, friend_email_id,friends_since, subscribed_to_posts) VALUES ('','','201305-09 18:19:07+0000',true); The output from the previous commands looks as shown in the following screenshot: SELECT * from Friends; Counting friends and posts We need to count and display the number of friends and posts a user has. While it is possible to compute this from the Friends and Posts table itself, Counter Tables provided by
  37. 37. Cassandra are specifically meant for such purposes. They store a number count and make it easy to increment/decrement them by an arbitrary number. Counter Tables are those that have all non-primary key columns of data type counter. Primary keys can never have a data type of counter. They may be composite keys though. Counters are always updated (not inserted) with a positive or negative value that causes it to increment/decrement, as the case may be. We'll now create a counter table for storing a user's friends and post count using the following command: CREATE TABLE UserCounter(user_id text,friends_count counter,posts_count counter,PRIMARY KEY(user_id)); Each time a user adds a friend (a record insertion in Friends column table), you need to increment friends_count by one. And each time a user submits a post (a record insertion in Posts and Timeline tables), You need to increment posts_count by one. Use the following command to do so: UPDATE UserCounter SET friends_count = ''; UPDATE UserCounter SET posts_count = ''; UPDATE UserCounter SET friends_count = ''; UPDATE UserCounter SET posts_count = ''; = friends_count +1 WHERE user_id posts_count + 3 WHERE user_id = = friends_count + 2 WHERE user_id posts_count + 1 WHERE user_id = If a record doesn't exist for a primary key given in a WHERE clause, a row is created and counter is initialized with null value. You can check updated counts on console: SELECT * FROM UserCounter;
  38. 38. Top 3 features you need to know about So far, we've built data models for a small facebook-like application, and learned some important CQL constructs in this exercise. It's time now to move on to learning some important CQL features that you would use. In this chapter, we'll discuss features such as time to Live, CQL for client applications, and batch operations. We'll also take a look at the different ways of writing client applications in Java using CQL. TTL (time to live) We've seen that DELETE, and sometimes TRUNCATE, are used to remove rows and columns from a table. In Cassandra, it's possible to specify an expiration period by executing the INSERT and UPDATE queries on the columns of a row. After the expiration of the specified period, column values are deleted by Cassandra. Syntax The syntax is as follows: INSERT INTO <Table name> (Column1, Column2, ...., Column n) VALUES (Value1, Value2, ......, Value n) USING TTL <ttl in seconds>; UPDATE <Table name> USING TTL <ttl in seconds> SET <Column 1> = <Value 1>, <Column2> = <Value2>,............ <Column n> = <Value n> WHERE <Primary Key column name> =<Primary Key>; After the record is inserted into the database, the remaining TTL of a column can be checked by using the following command: SELECT TTL (Column Name) FROM <Table Name> WHERE <Primary Key Column Name> = <Primary Key> ALLOW FILTERING; Gone in 60 seconds! Let's insert a record into the Posts table with a TTL of 60 seconds, as follows: USE facebook; INSERT INTO Posts(post_id, body, device, posted_at) VALUES (dadb6460fd48-11e2-ab5e-839e8de3c833,'This post goes away in 60 seconds','SMS','2013-08-03 14:01:11') USING TTL 60; SELECT TTL(body) FROM Posts WHERE post_id = dadb6460-fd48-11e2-ab5e839e8de3c833 ALLOW FILTERING; SELECT TTL(device) FROM Posts WHERE post_id = dadb6460-fd48-11e2ab5e-839e8de3c833 ALLOW FILTERING; Just after the INSERT operation, TTLs are set to 60 seconds as shown in the following
  39. 39. screenshot: After this insertion, update the expiration of the body column to 5 minutes. Check what happens to this record before expiration of 60 seconds: UPDATE Posts USING TTL 300 SET body='This post goes away in 5 minutes' WHERE post_id = dadb6460-fd48-11e2-ab5e-839e8de3c833; Run this before and after expiry of 60 seconds: SELECT TTL(body) FROM Posts WHERE post_id = dadb6460-fd48-11e2-ab5e839e8de3c833 ALLOW FILTERING; SELECT TTL(device) FROM Posts WHERE post_id = dadb6460-fd48-11e2ab5e-839e8de3c833 ALLOW FILTERING; SELECT post_id, body, device FROM Posts; Data is still intact before the expiration of 60 seconds:
  40. 40. After the expiration of 60 seconds, the data is lost, as shown in the following screenshot: We can see that the device column got deleted after because we extended its expiry to 5 minutes. 60 seconds, but the body column didn't, CQL for client applications CQL 3 is being pushed as a standard for building applications over Cassandra. Running CQL queries on the shell is fine for viewing and modifying the data for testing purpose, but the actual use of CQL lies with building applications just the way SQL is used with JDBC. Next, we are going to explain the three ways using which you can build Java applications over Cassandra (it's possible to build applications in other programming languages as well via appropriate drivers and clients). Using the Thrift API
  41. 41. The Thrift library is based on the Thrift RPC protocol. High-level clients built over it have been a standard way of building an application for a long time. In this section, we'll explain how to write a client application using CQL as the query language and thrift as the Java API. When we start Cassandra, by default it listens to Thrift clients (start_rpc: true property in the CASSANDRA_HOME/conf/cassandra.yaml file enables this). Let's build a small program that connects to Cassandra using the Thrift API, and runs CQL 3 queries for reading/writing data in the UserProfiles table we created for the facebook application. The program can be built by performing the following steps: 1. For downloading the Thrift Library, you need to enter apache-assandra-thrift-1.2.x.jar (which is to be found in the CASSANDRA_HOME/lib folder) into your classpath. If your Java project is mavenized, you need to insert the following entry in pom.xml under the dependency section (version will vary depending upon your Cassandra server installation): <dependency> <groupId>org.apache.cassandra</groupId> <artifactId>cassandra-thrift</artifactId> <version>1.2.5</version> </dependency> 2. For connecting to the Cassandra server on a given host and port, you need to open org.apache.thrift.transport.TTransport to the Cassandra node and create an instance of org.apache.cassandra.thrift.Cassandra.Client as follows: TTransport transport = new TFramedTransport(new TSocket("localhost", 9160)); TProtocol protocol = new TBinaryProtocol(transport); Cassandra.Client client = new Cassandra.Client(protocol);; client.set_cql_version("3.0.0"); Note The default CQL version for Thrift is 2.0.0. You must set it to 3.0.0 if you are writing CQL 3 queries and don't want to see any version related errors. After you are done with transport, close it gracefully (usually at the end of read/write operations) as follows: transport.close(); 3. Creating a schema: The executeQuery() utility method accepts runs it: String CQL 3 query and CqlResult executeQuery(String query) throws Exception { return client.execute_cql3_query(ByteBuffer.wrap(query.getBytes("UTF8")), Compression.NONE, ConsistencyLevel.ONE); }
  42. 42. Now, create keyspace and the table by directly executing CQL 3 query: //Create keyspace executeQuery("CREATE KEYSPACE facebook WITH replication "{'class':'SimpleStrategy','replication_factor':3};"); executeQuery("USE facebook;"); //Create table executeQuery("CREATE TABLE UserProfiles(" +"email_id text," + "password text,"+ "name text," + "age int," + "profile_picture blob," + "PRIMARY KEY(email_id)" + ");" ); = 4. Reading/writing data: A couple of records can be inserted as follows: executeQuery("USE facebook;"); executeQuery("INSERT INTO UserProfiles(email_id, password, name, age, profile_picture) VALUES('','p4ssw0rd','John Smith',32,0x8e37);"); executeQuery("INSERT INTO UserProfiles(email_id, password, name, age, profile_picture) VALUES('','guess1t','David Bergin',42,0xc9f1);"); Executing the SELECT query returns records: CQLResult, on which we can iterate easily to fetch CqlResult result = executeQuery("SELECT * FROM facebook.UserProfiles " + "WHERE email_id = '';"); for (CqlRow row : result.getRows()) { System.out.println(row.getKey(); } Using the Datastax Java driver The Datastax Java driver is based on the Cassandra binary protocol that was introduced in Cassandra 1.2, and works only with CQL 3. The Cassandra binary protocol is specifically made for Cassandra in contrast to Thrift, which is a generic framework and has many limitations. Now, we are going to write a Java program that uses the Datastax Java driver for reading/writing data into Cassandra, by performing the following steps: Downloading the driver library: This driver library JAR file must be in your classpath in order to build an application using it. If you have a maven-based Java project, you need
  43. 43. to insert the following entry into the pom.xml file under the dependeny section: <dependency> <groupId>com.datastax.cassandra</groupId> <artifactId>cassandra-driver-core</artifactId> <version>1.0.1</version> </dependency> Note This driver project is hosted on Github: ( It makes sense to check and download the latest version. Configuring Cassandra to listen to native clients: In the newer version of Cassandra, this would be enabled by default and Cassandra will listen to clients using binary protocol. But the earlier Cassandra installations may require enabling this. All you have to do is to check and enable the start_native_transport property into the CASSANDRA_HOME/conf/Cassandra.yaml file by inserting/uncommenting the following line: start_native_transport: true The port that Cassandra will use for listening to native clients is determined by the native_transport_port property. Note It is possible for Cassandra to listen to both Thrift and native clients simultaneously. If you want to disable Thrift, just set the start_rpc property to false in CASSANDRA_HOME/conf/Cassandra.yaml. Connecting to Cassandra: The com.datastax.driver.core.Cluster class is the entry point for clients to connect to the Cassandra cluster: Cluster cluster = Cluster.builder().addContactPoint("").build(); After you are done with using it (usually when application shuts down), close it gracefully: cluster.shutdown(); Creating a session: An object of com.datastax.driver.core.Session allows you to execute a CQL 3 statement. The following line creates a Session instance: Session session = cluster.connect();
  44. 44. Creating a schema: Before reading/writing data, let's create a keyspace and a table similar to UserProfiles in the facebook application we built earlier: // Create Keyspace session.execute("CREATE KEYSPACE facebook WITH replication = " + "{'class':'SimpleStrategy','replication_factor':1};"); session.execute("USE facebook"); // Create table session.execute("CREATE TABLE UserProfiles(" + "email_id text," + "password text,"+ "name text," + "age int," + "profile_picture blob," + "PRIMARY KEY(email_id)" + ");" ); Reading/writing data: We can insert a couple of records as follows: session.execute("USE facebook"); session.execute("INSERT INTO UserProfiles(email_id, password, name, age, profile_picture) VALUES('','p4ssw0rd','John Smith',32,0x8e37);"); session.execute("INSERT INTO UserProfiles(email_id, password, name, age, profile_picture) VALUES('','guess1t','David Bergin',42,0xc9f1);"); Finding and printing records: A SELECT query returns an instance of com.datastax.driver.core.ResultSet. You can fetch individual rows by iterating over it using the com.datastax.driver.core.Row object: ResultSet results = session.execute("SELECT * FROM facebook.UserProfiles " + "WHERE email_id = '';"); for (Row row : results) { System.out.println ("Email: " + row.getString("email_id") + "tName: " + row.getString("name")+ "t Age : " + row.getInt("age")); } Deleting records: We can delete a record as follows: session.execute("DELETE FROM facebook.UserProfiles WHERE email_id='';"); Using high-level clients In addition to the libraries based on Thrift and binary protocols, some high-level clients are built with the purpose to ease development and provide additional services, such as connection pooling, load balancing, failover, secondary indexing, and so on. Some of them are listed here: Astyanax ( Astyanax is a high-level Java client for
  45. 45. Cassandra. It allows you to run both simple and prepared CQL queries. Hector ( Hector is a high-level client for Cassandra. At the time of writing this book, it supported CQL 2 only (not CQL 3). Kundera ( Kundera is a JPA 2.0-based object datastore mapping library for Cassandra and many other NoSQL datastores. CQL 3 queries are run with Kundera using the native queries as described in JPA specification. Note If you're interested in learning these high-level client APIs, Refer to their respective links in the Articles and tutorials section at the end of this book. Batch operations Batch operations in Cassandra allow you to group similar DML statements (INSERT, UPDATE, or DELETE) and run them in one pass. So, how is it different from multiple statements without a batch boundary? From Cassandra 1.2 onward, batch operations are atomic by default. This means either the entire batch succeeds or, in case of an error in any part of the batch, none of the statements will succeed. A common timestamp is applied to all CQL statements within a batch. DML statements in a batch can't specify their own timestamp, though they are free to use their own TTLs. Multiple statements are transferred between nodes in one pass that usually saves time required to execute the entire operation set as a whole. Syntax The syntax is as follows: Multiple DML statements are wrapped between BEGIN in [] brackets are optional: BEGIN DML DML DML [UNLOGGED statement statement statement ..... DML statement APPLY BATCH; BATCH / APPLY BATCH. | COUNTER] BATCH [USING TIMESTAMP timestamp] 1; 2; 3; n; Let's perform a simple batch operation on the UserProfiles table: Constructs written
  46. 46. BEGIN BATCH USING TIMESTAMP 1318452291034 INSERT INTO UserProfiles(email_id, password, name, age) VALUES('','gh5ys9','Patricia Ireland', 35) INSERT INTO UserProfiles(email_id, password, name, age) VALUES('','ha9vb1a','Donald Miller', 45) UPDATE UserProfiles SET password='newp4ssw0rd' WHERE email_id='' DELETE FROM UserProfiles WHERE email_id='' APPLY BATCH; After end of this batch, one record with value of the primary key as will be inserted into UserProfiles. Three flavors of the batch operation Batch operations in Cassandra come in three flavors, which are as follows: Atomic batches (using BEGIN BATCH): This is the default mode and guarantees atomicity; that is, batch operations that are "Either all or none". Atomic batches are achieved in Cassandra by first serializing the batch to a system table named batchlog as a blob in data column. The batchlog table resides in the schema keyspace, and has a following structure: CREATE TABLE batchlog ( id uuid PRIMARY KEY, data blob, written_at timestamp ); After all the rows within the batch are successfully written, the batchlog data is removed. Non-atomic batches (using BEGIN UNLOGGED BATCH): Because of batch log overheads, atomic batches are 30 percent slower as compared to running DML statements without a batch boundary. In cases where atomicity guarantees are not really needed, we can use BEGIN UNLOGGED BATCH, which is non-atomic in nature and doesn't use the batch log. Counter batches (using BEGIN COUNTER BATCH): Since counter operations are not idempotent (that is, these are the operations that can run multiple times without changing outcome), replaying multiple counter operations in a batch may give undesirable results. Cassandra provides counter batches for running multiple counter operations in a batch.
  47. 47. People and places you should get to know Cassandra is officially an Apache Software Foundation project and lots of documentation is available on the Apache website. Top developers and founders are working in Datastax ( and their website hosts rich collection of up-to-date documentations. Official sites Cassandra home page, an official website that provides download and documentation links are found at CQL 3.0 manual from the Datastax website at Cassandra Wiki (takes to useful links): Top developers and thought leaders of Cassandra blog are found at Read, listen, watch, meet, and learn, your lonely Cassandra planet: Source code available here, can be cloned using git: Client options for CQL is found at Articles and tutorials Datastax Java Driver 1.0 for Apache Cassandra documentation: Cassandra binary protocol blog: Running CQL queries with Astyanax: Running CQL queries with Hector: Running CQL Queries with Kundera: Community Official mailing list: Official IRC channel: #cassandra User FAQ: Blogs Datastax Developer Blog: Jonathan Ellis's technical blog: Twitter
  48. 48. Follow Jonathan Ellis (co-founder of Cassandra) on Twitter: Follow Sylvain Lebresne (Cassandra committer) on Twitter Follow Planet Cassandra on Follow DataStax on For more open source information, follow Packt at!/packtopensource Prepared for Jeff Marquez, Safari ID User number: 3575390 © 2013 Safari Books Online, LLC. This download file is made available for personal use only and is subject to the Terms of Service. Any other use requires prior written consent from the copyright owner. Unauthorized use, reproduction and/or distribution are strictly prohibited and violate applicable laws. All rights reserved.