DataStax NYC Java Meetup: Cassandra with Java


Published on

DataStax presentation at NYC Java Meetup on June 30, 2014 on using Apache Cassandra via Java

Published in: Technology, Education
  • Be the first to comment

No Downloads
Total Views
On Slideshare
From Embeds
Number of Embeds
Embeds 0
No embeds

No notes for slide
  • DSE
    C* Architecture
    Integration w Java
  • collections / playlist: Spotify
    Fraud: financial services
    Recommendations / personalization: EBay
    IoT - smart devices: nest
    Messaging: Instagram
  • Massively scalable NoSQL database/Netflix example: 10 million/sec; 1 trillion/day; 3000 nodes

    Q: kilo, mega, giga, tera, peta, exabyte, zetta, yotta
  • Always on:
    Peer to peer architecture – all nodes are equal; each node is responsible for an assigned range (or ranges) of data
    Clients can write (or read0 data to any node in the ring – native drivers can round robin across a DC and distribute load to a coordinator node
    Coordinate node writes (or reads) copies of data to nodes which own each copy
    In the case of a failure (such as a drive going down),
    2 out of the 3 nodes are still on, so the ability to write and read data still works for the majority of nodes and therefore C* is always on
  • Multi-DC is very, very easy to configure with Cassandra
    Datacenters are active – active: write to either DC and the other one will get a copy
    In the case of a datacenter outage, applications can carry on a retry policy which flips over to the other datacenter which also has a copy of the data;

    Outbrain story – Hurricane Sandy


    What is Quorum? i.e. 10 nodes, rf = 6 (q=4 -> can tolerate 2 down nodes)
  • similar to sql but does not have all sql options
    DevCenter / cqlsh
  • keyspace = db instance
    simple strategy vs NetworkTopologyStrategy
    replication factor
  • data types
    primary key / partition key
  • set: ordered alphabetically. ex: preferences: email subscriptions
    list: specific order. ex: search engine: google, yahoo, bing
    map: phone & phone type
  • user defined types
    batch -> all or nothing - atomic
  • Asynchronous: Netty is an asynchronous event-driven network application framework. Not blocking threads -> can scale
    It greatly simplifies and streamlines network programming such as TCP and UDP socket server / support request pipeline
    Other drivers: C#, Python, ODBC + community drivers
    high frequency / multi threaded
    get push requests back from C* i.e., topology changes, node status changes, schema changes
  • v2.0.2
  • contact points = discover cluster. not connect to these nodes
    Retrieves metadata from the cluster, including:
    Name of the cluster , IP address, rack, and data center for each node
    Session instances are thread-safe and normally a single session instance is all you need per application per keyspace
    schemaless -> no create date
  • exectureAsync => returns future implementing Guavas futures => large scale > execute long query > callbacks
    Execute async to get a future on your resultset
    - which means from a single thread you can push a bunch of requests to be execute in paralell
    - look at futures in Java
  • example: read w callback
    break into small queries, execute as many in C* -> send as much as C* can take
  • fire ton of select, get futures back & iterate & do …
    breaking into small queries -> better latency, easier to retry small query
  • prepared, compiled, execute
    for CQL statements executed multiple times
    for performance: precompiled & cached on server. C* parses once & exec multiple times
    use ? as placeholder for literal values
    gets passed as BoundStatement object
    BS gets executed by Session object
  • programmatically build query
    When you need to generate queries dynamically, quite easy
    Or if your just familiar with it

    Note: the default consistency level of ConsistencyLevel.ONE
  • Round Robin: not DC aware; request goes to next node in cluster
    DCAwareRoundRobin: goes to next node in next DC
  • client connects to node where data resides. performance advantage
    HQ: SF
    offices in Austin, London, Paris
    delivering Apache C* to the enterprise
  • DataStax is the company that delivers Cassandra to the enterprise. First, we take the open source software and put it through rigorous quality assurance tests including a 1000 node scalability test. We certify it and provide the worlds most comprehensive support, training and consulting for Cassandra so that you can get up and running quickly. But that isn’t all DataStax does. We also build additional software features on top of DataStax including security, search, analytics as well as provide in memory capabilities that don’t come with the open source Cassandra product. We also provide management services to help visualize your nodes, plan your capacity and repair issues automatically. Finally, we also provide developer tools and drivers as well as monitoring tools. DataStax is the commercial company behind Apache Cassandra plus a whole host of additional software and services.
  • Spark integration - 100x faster analytics - integrated real-time analytics
    performance tuning - supply diagnostic info on how cluster is doing
  • DataStax NYC Java Meetup: Cassandra with Java

    1. 1. Confidential Apache Cassandra & Java Caroline George DataStax Solutions Engineer June 30, 2014 1
    2. 2. Agenda 2 • Cassandra Overview • Cassandra Architecture • Cassandra Query Language • Interacting with Cassandra using Java • About DataStax
    4. 4. Who is using DataStax? 4 Collections / Playlists Recommendation / Personalization Fraud detection Messaging Internet of Things / Sensor data
    5. 5. What is Apache Cassandra? Apache Cassandra™ is a massively scalable NoSQL database. • Continuous availability • High performing writes and reads • Linear scalability • Multi-data center support
    6. 6. 6 The NoSQL Performance Leader Source: Netflix Tech Blog Netflix Cloud Benchmark… “In terms of scalability, there is a clear winner throughout our experiments. Cassandra achieves the highest throughput for the maximum number of nodes in all experiments with a linear increasing throughput.” Source: Solving Big Data Challenges for Enterprise Application Performance Management benchmark paper presented at the Very Large Database Conference, 2013. End Point Independent NoSQL Benchmark Highest in throughput… Lowest in latency…
    7. 7. 10 50 3070 80 40 20 60 Client Client Replication Factor = 3 We could still retrieve the data from the other 2 nodes Token Order_id Qty Sale 70 1001 10 100 44 1002 5 50 15 1003 30 200 Node failure or it goes down temporarily Cassandra is Fault Tolerant
    8. 8. Client 10 50 3070 80 40 20 60 Client 15 55 3575 85 45 25 65 West Data CenterEast Data Center 10 50 3070 80 40 20 60 Data Center Outage Occurs No interruption to the business Multi Data Center Support
    9. 9. 9 Writes in Cassandra Data is organized into Partitions 1. Data is written to a Commit Log for a node (durability) 2. Data is written to MemTable (in memory) 3. MemTables are flushed to disk in an SSTable based on size. SSTables are immutable Client Memory SSTables Commit Log Flush to Disk
    10. 10. 10 Tunable Data Consistency
    11. 11. 11 Built for Modern Online Applications • Architected for today’s needs • Linear scalability at lowest cost • 100% uptime • Operationally simple
    12. 12. Cassandra Query Language 12
    13. 13. CQL - DevCenter 13 A SQL-like query language for communicating with Cassandra DataStax DevCenter – a free, visual query tool for creating and running CQL statements against Cassandra and DataStax Enterprise.
    14. 14. CQL - Create Keyspace 14 CREATE KEYSPACE demo WITH REPLICATION = {'class' : 'NetworkTopologyStrategy', 'EastCoast': 3, 'WestCoast': 2);
    15. 15. CQL - Basics 15 CREATE TABLE users ( username text, password text, create_date timestamp, PRIMARY KEY (username, create_date desc); INSERT INTO users (username, password, create_date) VALUES ('caroline', 'password1234', '2014-06-01 07:01:00'); SELECT * FROM users WHERE username = ‘caroline’ AND create_date = ‘2014-06- 01 07:01:00’; Predicates On the partition key: = and IN On the cluster columns: <, <=, =, >=, >, IN
    16. 16. Collection Data Types 16 CQL supports having columns that contain collections of data. The collection types include: Set, List and Map. Favor sets over list – better performance CREATE TABLE users ( username text, set_example set<text>, list_example list<text>, map_example map<int,text>, PRIMARY KEY (username) );
    17. 17. Plus much more… 17 Light Weight Transactions INSERT INTO customer_account (customerID, customer_email) VALUES (‘LauraS’, ‘’) IF NOT EXISTS; UPDATE customer_account SET customer_email=’’ IF customer_email=’’; Counters UPDATE UserActions SET total = total + 2 WHERE user = 123 AND action = ’xyz'; Time to live (TTL) INSERT INTO users (id, first, last) VALUES (‘abc123’, ‘abe’, ‘lincoln’) USING TTL 3600; Batch Statements BEGIN BATCH INSERT INTO users (userID, password, name) VALUES ('user2', 'ch@ngem3b', 'second user') UPDATE users SET password = 'ps22dhds' WHERE userID = 'user2' INSERT INTO users (userID, password) VALUES ('user3', 'ch@ngem3c') DELETE name FROM users WHERE userID = 'user2’ APPLY BATCH;
    18. 18. JAVA CODE EXAMPLES 18
    19. 19. DataStax Java Driver 19 • Written for CQL 3.0 • Uses the binary protocol introduced in Cassandra 1.2 • Uses Netty to provide an asynchronous architecture • Can do asynchronous or synchronous queries • Has connection pooling • Has node discovery and load balancing
    20. 20. Add .JAR Files to Project 20 Easiest way is to do this with Maven, which is a software project management tool
    21. 21. Add .JAR Files to Project 21 In the pom.xml file, select the Dependencies tab Click the Add… button in the left column Enter the DataStax Java driver info
    22. 22. Connect & Write 22 Cluster cluster = Cluster.builder() .addContactPoints("", "") .build(); Session session = cluster.connect("demo"); session.execute( "INSERT INTO users (username, password) ” + "VALUES(‘caroline’, ‘password1234’)" ); Note: Cluster and Session objects should be long-lived and re-used
    23. 23. Read from Table 23 ResultSet rs = session.execute("SELECT * FROM users"); List<Row> rows = rs.all(); for (Row row : rows) { String userName = row.getString("username"); String password = row.getString("password"); }
    24. 24. Asynchronous Read 24 ResultSetFuture future = session.executeAsync( "SELECT * FROM users"); for (Row row : future.get()) { String userName = row.getString("username"); String password = row.getString("password"); } Note: The future returned implements Guava's ListenableFuture interface. This means you can use all Guava's Futures1 methods! 1
    25. 25. Read with Callbacks 25 final ResultSetFuture future = session.executeAsync("SELECT * FROM users"); future.addListener(new Runnable() { public void run() { for (Row row : future.get()) { String userName = row.getString("username"); String password = row.getString("password"); } } }, executor);
    26. 26. Parallelize Calls 26 int queryCount = 99; List<ResultSetFuture> futures = new ArrayList<ResultSetFuture>(); for (int i=0; i<queryCount; i++) { futures.add( session.executeAsync("SELECT * FROM users " +"WHERE username = '"+i+"'")); } for(ResultSetFuture future : futures) { for (Row row : future.getUninterruptibly()) { //do something } }
    27. 27. Prepared Statements 27 PreparedStatement statement = session.prepare( "INSERT INTO users (username, password) " + "VALUES (?, ?)"); BoundStatement bs = statement.bind(); bs.setString("username", "caroline"); bs.setString("password", "password1234"); session.execute(bs);
    28. 28. Query Builder 28 Query query = QueryBuilder .select() .all() .from("demo", "users") .where(eq("username", "caroline")); ResultSet rs = session.execute(query);
    29. 29. Load Balancing 29 Determine which node will next be contacted once a connection to a cluster has been established Cluster cluster = Cluster.builder() .addContactPoints("","") .withLoadBalancingPolicy( new DCAwareRoundRobinPolicy("DC1")) .build(); Policies are: • RoundRobinPolicy • DCAwareRoundRobinPolicy (default) • TokenAwarePolicy
    30. 30. RoundRobinPolicy 30 • Not data-center aware • Each subsequent request after initial connection to the cluster goes to the next node in the cluster • If the node that is serving as the coordinator fails during a request, the next node is used
    31. 31. DCAwareRoundRobinPolicy 31 • Is data center aware • Does a round robin within the local data center • Only goes to another data center if there is not a node available to be coordinator in the local data center
    32. 32. TokenAwarePolicy 32 • Is aware of where the replicas for a given token live • Instead of round robin, the client chooses the node that contains the primary replica to be the chosen coordinator • Avoids unnecessary time taken to go to any node to have it serve as coordinator to then contact the nodes with the replicas
    33. 33. Additional Information & Support 33 • Community Site ( • Documentation ( • Downloads ( • Getting Started ( • DataStax (
    34. 34. ABOUT DATASTAX 34
    35. 35. About DataStax 35 Founded in April 2010 30 Percent 500+ Customers Santa Clara, Austin, New York, London 300+ Employees
    36. 36. Confidential DataStax delivers Apache Cassandra to the Enterprise 36 Certified / Enterprise-ready Cassandra Visual Management & Monitoring Tools 24x7 Support & Training
    37. 37. 37
    38. 38. DSE 4.5 38
    39. 39. Thank You! @carolinerg Follow for more updates all the time: @PatrickMcFadin