Getting Started with DataStax Enterprise from a Technical Perspective

5,412 views

Published on

The requirements for building today’s online applications have changed. Implementing legacy technology hinders your ability to innovate, ensure application performance, and meet the demands of your customers. So how do you determine what underlying systems are the right fit for your needs?

Join us as we review the following to help you get started with DataStax Enterprise:

- What is Cassandra and why should you care?
- What is DataStax Enterprise and how does it differ from Cassandra?
- What are the steps to evaluating DataStax Enterprise?
- Valuable resources to get up to speed on Cassandra and DataStax Enterprise

Published in: Software, Technology
0 Comments
2 Likes
Statistics
Notes
  • Be the first to comment

No Downloads
Views
Total views
5,412
On SlideShare
0
From Embeds
0
Number of Embeds
5
Actions
Shares
0
Downloads
76
Comments
0
Likes
2
Embeds 0
No embeds

No notes for slide
  • Today, we are going to cover the basics of to go over the technical basics of Cassandra and DataStax Enterprise and then discuss the typical evaluation process.
  • Count of current companies/groups: over 1000 using Cassandra, over 500 using DataStax
  • This presentation will focus on three practical topics to getting started with DataStax Enterprise: (1) understanding why C*, (2)why DSE, and how clients typically evaluate the process with recommended resources, along the way.
  • Massively scalable NoSQL database/Netflix example: 10 million/sec; 1 trillion/day; 3000 nodes
  • Always on:
    Peer to peer architecture – all nodes are equal; each node is responsible for an assigned range (or ranges) of data
    Clients can write (or read0 data to any node in the ring – native drivers can round robin across a DC and distribute load to a coordinator node
    Coordinate node writes (or reads) copies of data to nodes which own each copy
    In the case of a failure (such as a drive going down),
    2 out of the 3 nodes are still on, so the ability to write and read data still works for the majority of nodes and therefore C* is always on
  • Independent benchmarks proving out linear scalability – Netflix and University of Toronto; at any nodes, this is what we are seeing for read/write mix

    Source: Solving Big Data Challenges for Enterprise Application Performance Management, Tilman Rable, et al., August 2013, p. 10. Benchmark paper presented at the Very Large Database Conference, 2013. http://vldb.org/pvldb/vol5/p1724_tilmannrabl_vldb2013.pdf

    Source: http://techblog.netflix.com/2011/11/benchmarking-cassandra-scalability-on.html

  • Need to speed up your reads and write? Very simple to add nodes. The improvement in response time is truly linear, as a result of the peer to peer architecture of sharing the data. Netflix – 3000 nodes – bring up or down 500 nodes to manage anticipated spikes in load
  • Multi-DC is very, very easy to configure with Cassandra
    Datacenters are active – active: write to either DC and the other one will get a copy
    In the case of a datacenter outage, applications can carry on a retry policy which flips over to the other datacenter which also has a copy of the data;

    Outbrain story – Hurricane Sandy
  • Choice for today’’s modern online applications – architects know that these types of applications must always stay on and therefore need to easily scale to handle load
  • We’ve covered the benefits of using Cassandra: (1) high availability, (2) linear scalability, and (3) ease of multi-DC configuration
  • Now, we’ll cover the value of DSE – what does DataStax Enterprise bring to the table?
  • DataStax is the company that delivers Cassandra to the enterprise. First, we take the open source software and put it through rigorous quality assurance tests including a 1000 node scalability test. We certify it and provide the worlds most comprehensive support, training and consulting for Cassandra so that you can get up and running quickly. But that isn’t all DataStax does. We also build additional software features on top of DataStax including security, search, analytics as well as provide in memory capabilities that don’t come with the open source Cassandra product. We also provide management services to help visualize your nodes, plan your capacity and repair issues automatically. Finally, we also provide developer tools and drivers as well as monitoring tools. DataStax is the commercial company behind Apache Cassandra plus a whole host of additional software and services.
  • Side by side comparison of what C* open source offers compared to DSE; note the tested and certified version of the binaries for productions plus product features and support
  • Visual, browser-based user interface negates need to install client software
    Administration tasks carried out in point-and-click fashion
    Allows for visual rebalance of data across a cluster when new nodes are added
    Contains proactive alerts that warn of impending issues.
    Built-in external notification abilities
    Visually perform and schedule backup operations
  • CQL as serviced up using DevCenter – works with community too; worth mentioning given the ease of working with CQL and its similarities with SQL
  • Internal Authentication Manages login IDs and passwords inside the database

    Ensures only authorized users can access a database system using internal validation
    Simple to implement and easy to understand
    No learning curve from the relational world

    Object Permission Management
    controls who has access to what and who can do what in the database

    Provides granular based control over who can add/change/delete/read data
    Uses familiar GRANT/REVOKE from relational systems
    No learning curve

    Client to Node Encryption protects data in flight to and from a database cluster

    Ensures data cannot be captured/stolen in route to a server
    Data is safe both in flight from/to a database and on the database; complete coverage is ensured
  • External Authentication uses external security software packages to control security

    Only authorized users have access to a database system using external validation
    Uses most trusted external security packages (Kerberos, LDAP), mainstays in government and finance
    Single sign on to all data domains

    Transparent Data Encryption encrypts data at rest


    Protects sensitive data at rest from theft and from being read at the file system level
    No changes needed at application level
    Can encrypt both Cassandra and Hadoop data

    Data Auditing provides trail of who did and looked at what/when


    Supplies admins with an audit trail of all accesses and changes
    Granular control to audit only what’s needed
    Uses log4j interface to ensure performance and efficient audit operations

  • Built-in enterprise search on Cassandra data via Solr integration
    Very fast performance
    Search indexes can span multiple data centers (regular Solr cannot)
    Online scalability via adding new nodes
    Built-in failover; continuously available
  • Same concepts apply for Hadoop in analytics nodes as compared with SOLR nodes: a great way to run reporting on your data in your database without having to worry about porting over to a separate Hadoop environment – not a substitute for Hadoop, but perfect for a great deal of use cases
  • Here is a diagram of the typical process which clients run through when trying out DataStax. Often, a developer and DBA downloads and installs the sandbox on their local laptop in a Linux environment, such as VM, or an a dev box, just to try it out. Along the way of discovery, use cases are evaluated for fit and data models are designed. At a certain point, there will be a desire to test out how Cassandra and DSE, as a whole works within a multi-clustered environment. Sample data loaded using a given data model and then benchmarks are performed – how hard can you hit the typically 4 nodes with 3 copies of data until the write/read breaks the box. Cassandra stresstool or the drivers are used to create the read/write mix. Based on behavior for 4 nodes, for example, load can be linearly projected (or tested for that matter) for more nodes.




    Pertinents links are provided below:
    Sandbox download – http://www.datastax.com/download#dl-sandbox
    Binaries download – http://www.datastax.com/download#dl-enterprise
    Typical use cases on Planet Cassandra – http://planetcassandra.org/functional-use-cases/ (by function) and http://planetcassandra.org/industry-use-cases/ (by industry)
    SOLR Tutorial and Overview - http://www.datastax.com/documentation/datastax_enterprise/4.0/datastax_enterprise/srch/srchTOC.html
    Hadoop Overview - http://www.datastax.com/documentation/datastax_enterprise/4.0/datastax_enterprise/ana/anaTOC.html
    Data Modeling – http://www.datastax.com/documentation/cql/3.1/cql/ddl/ddlCQLDataModelingTOC.html
    http://www.datastax.com/documentation/cql/3.1/cql/ddl/ddl_intro_c.html http://www.datastax.com/documentation/cql/3.1/cql/cql_using/about_cql_c.html

    Copy command – http://www.datastax.com/documentation/cql/3.1/cql/cql_reference/copy_r.html
    Java driver - http://www.datastax.com/documentation/developer/java-driver/2.0/common/drivers/introduction/introArchOverview_c.html
    Cassandra Stress Tool - http://www.datastax.com/documentation/cassandra/2.0/cassandra/tools/toolsCStress_t.html
  • Here are some of the recommended settings for your PoC environment. Again, we highly recommend to start with at least 3 copies of data across 4 nodes. SSD’s are by far the preferred drive: you will save on number of servers needed and electricity paid and the response time of these drives is on the order of magnitude of 100 times faster for reads and writes. With the latest 3.x version of Linux, buffered caching is optimized which helps with performance, given buffered cache is a another way of caching data – the more RAM, the better especiallly for caching. RAM should be at least 16GB’s per box. We have no preference as to which cloud environment is used. There are Amazon AMI’s already set up to get folks jump started on DSE – they can be found by searching for DataStax in the EC2 marketplace. VM images on hosted boxes work fine but you will lose around 10% efficiency, due to resource sharing; if going VM, please certain to use directly physically mounted drives per image. SAN is highly discouraged.

    Hardware Recommendations - http://www.datastax.com/documentation/cassandra/2.0/cassandra/architecture/architecturePlanningHardware_c.html

    Standard Install Instructions - http://www.datastax.com/documentation/datastax_enterprise/4.0/datastax_enterprise/install/installTOC.html


    EC2 Install with template DSE AMI’s - http://vimeo.com/89539972


    EC2 Planning Out a Cluster - http://www.datastax.com/documentation/cassandra/2.0/cassandra/architecture/architecturePlanningHardware_c.html
     
    Reference Architecture - http://www.datastax.com/wp-content/uploads/2014/01/WP-DataStax-Enterprise-Reference-Architecture.pdf


    See Appendix for EC2 Install with Linux AMI’s (Slide #27)
  • There are lots of free resources available at people’s disposal for both education and evaluation. Most of the items listed on the left of this slide are reachable through the datastax.com website. . In this discussion, we are focussing more on the items on the left hand side; however, there are places where paid-for items make a lot of sense. For example, public training events can be registered for and are listed on datastax.com. Some clients opt to have in-person specialized training for a day or two with an architect. Your account rep can walk you through options int terms of each of three engagement models we provide, tailored to meet your needs.
    There is are also helpful starter packages which you can discuss with the account managers.
  • With respect to assistance, there are three categories of people support which DataStax provides. For example learning how SOLR and Hadoop nodes work, are covered in training. Specific questions, best practices, or performance tuning would be more along the lines of consulting. Support address bugs for clients.
  • Here are some links we’ve found that we’ve had to provide to lots of clients along the way and we felt with worth sharing, starting with Patrick McFadin’s four recorded videos on data modeling.
    Patrick McFadin’s Data Modeling Series - http://wiki.apache.org/cassandra/DataModel
    Advance Time Series Best Practices - http://planetcassandra.org/blog/post/getting-started-with-time-series-data-modeling/
    CQL/Data Modeling on DataStax - http://www.datastax.com/documentation/cql/3.1/cql/ddl/ddl_intro_c.html http://www.datastax.com/documentation/cql/3.1/cql/cql_using/about_cql_c.html
    Virtual Training - http://www.datastax.com/what-we-offer/products-services/training/virtual-training#tab
    Public Training Signup - http://www.datastax.com/what-we-offer/products-services/training
    Sample Projects (Java driver code, etc) - https://github.com/DataStaxCodeSamples/
    SOLR Documentation and Tutorial on DataStax - http://www.datastax.com/documentation/datastax_enterprise/4.0/datastax_enterprise/srch/srchTOC.html
    Analytics documentation - http://www.datastax.com/documentation/datastax_enterprise/4.0/datastax_enterprise/ana/anaTOC.html
    Github code samples - https://github.com/DataStaxCodeSamples?query=+only%3Apublic+

  • There are lots of readily available resources, as you can see, so hopefully this will make your evaluation process as efficient as possible.
  • * http://research.google.com/archive/bigtable.html
  • San Francisco has RF=3 Boston has RF = 2
  • Learning Objective: Describe how to read data

    This slide demonstrates how to check for “row not found” condition. Best practice to check

    Also demonstrates the use of the one() method where just one row (or possibly notfound) is expected.
  • Learning Objective: Describe what prepared statements are and when to use them

    This is an example of using prepared statements.
    Prepared statements can be used for inserts or queries typically in a loop (not shown).
    Focus on the exceptions here also, you don’t need to catch all of these but the strings point out the type error.

    Conserving white space where possible here.
  • PreparedStatement statement = session.prepare(
    "INSERT INTO user (username, password) " +
    "VALUES (?, ?);");

    BoundStatement boundStatement = new BoundStatement(statement);

    try {

    session.execute(boundStatement.bind("user4”,"user4password"));

    } catch (NoHostAvailableException ex) {
    System.out.println("Host Not Available");
    } catch (QueryExecutionException ex) {
    System.out.println (”Syntax error, runtime, not authorized");
    } catch (QueryValidationException ex) {
    System.out.println ("Requested consistency level not met");
    }

  • Getting Started with DataStax Enterprise from a Technical Perspective

    1. 1. Getting Started with DataStax Enterprise A Technical Overview Confidential 1
    2. 2. Agenda Confidential 3 Why Cassandra? Why DataStax Enterprise? How to Evaluate?
    3. 3. Confidential 4 Why Cassandra?
    4. 4. What is Apache Cassandra? Apache Cassandra™ is a massively scalable NoSQL database. • Continuous availability • High performing writes and reads • Linear scalability • Multi-data center support
    5. 5. 10 50 3070 80 40 20 60 Client Client Replication Factor = 3 We could still retrieve the data from the other 2 nodes Token Order_id Qty Sale 70 1001 10 100 44 1002 5 50 15 1003 30 200 Node failure or it goes down temporarily Cassandra is Fault Tolerant
    6. 6. Source: Netflix Tech Blog Netflix Cloud Benchmark… “In terms of scalability, there is a clear winner throughout our experiments. Cassandra achieves the highest throughput for the maximum number of nodes in all experiments with a linear increasing throughput.” Source: Solving Big Data Challenges for Enterprise Application Performance Management benchmark paper presented at the Very Large Database Conference, 2013. End Point Independent NoSQL Benchmark Highest in throughput… Lowest in latency… The NoSQL Performance Leader
    7. 7. Linearly Scalable 10 50 3070 80 40 20 60 10 30 2040100,000 txns per sec 200,000 txns per sec 400,000 txns/ per sec Simply add nodes to double, quadruple performance and capacity 10 20
    8. 8. Client 10 50 3070 80 40 20 60 Client 15 55 3575 85 45 25 65 East Data CenterWest Data Center 10 50 3070 80 40 20 60 Data Center Outage Occurs No interruption to the business Multi Data Center Support
    9. 9. Built for Modern Online Applications • Architected for today’s needs • Linear scalability at lowest cost • 100% uptime • Operationally simple
    10. 10. Agenda Confidential 11 Why Cassandra? • Scale with ease • Always on • Deploy across data centers
    11. 11. Agenda Confidential 12 Why Cassandra? Why DataStax Enterprise? • Scale with ease • Always on • Deploy across data centers
    12. 12. DataStax delivers Apache Cassandra to the Enterprise Confidential 13
    13. 13. DataStax supports both the open source community and modern business enterprises. Why DataStax? Open Source DataStax Enterprise Apache Cassandra (Cassandra Chair and 30% of committers) Community Edition Enterprise Edition (Tested & Certified for Production) OpsCenter Standard Enterprise (Alerts, Automated Management Services, Cluster Management) DevCenter   Drivers/Connectors   Online Documentation   Online Training   Mailing Lists and Forums   Security Standard Enterprise (Kerberos Authentication & SSL Encryption) Built-in Real-time Analytics  Built-in Enterprise Search  In-Memory Database Option  Expert Support (24x7x365)  Consultative Support  Onsite Training 
    14. 14. • Visual browser-based UI • Point-and-click administration • Visual cluster management • Proactive alerts • Built-in external notifications • Visual backup operations DataStax OpsCenter
    15. 15. Cassandra Query Language (CQL) DataStax DevCenter – a free, visual query tool for creating and running CQL statements against Cassandra and DataStax Enterprise.
    16. 16. Internal Authentication Internal validation of authorized users Simple to implement & easy to understand No learning curve Object Permission Management Deep control over who can add/change/delete/read data Uses familiar GRANT/REVOKE from relational world No learning curve Client to Node Encryption Ensures data cannot be captured/stolen in route to a server Data is safe both in flight from/to a database and on the database Complete coverage is ensured Cassandra Security
    17. 17. External Authentication External validation of authorized users Leverages Kerberos & LDAP) Single sign-on to all data domains Transparent Data Encryption Protects sensitive data at rest via SSL No changes needed at application level Encrypt both Cassandra and Hadoop data Data Auditing Audit trail of all accesses and changes Control to audit only what’s needed Uses log4j interface to ensure performance & efficient audit operations DataStax Enterprise Security
    18. 18. • Delivers Solr integration • Very fast performance • Search indexes span multiple data centers (regular Solr cannot) • Online scalability via adding new nodes • Built-in failover; continuously available Built-in Enterprise Search C* & Solr C* & Solr C* & Solr C* & Solr
    19. 19. • Real-time analytics on Cassandra hot data • MapReduce, Hive, Pig, Sqoop, and Mahout • No single points of failure Built-In Enterprise Analytics Enterprise Analytics MapReduce, Hive, Pig, More Continuous availability Integrated big data platform C* & Hadoo p C* & Hadoo p C* & Hadoo p C* & Hadoo p
    20. 20. Agenda Confidential 21 Why Cassandra? Why DataStax Enterprise? • Scale with ease • Always on • Deploy across data centers • Enterprise-ready capabilities • 24x7x365 support
    21. 21. Agenda Confidential 22 Why Cassandra? Why DataStax Enterprise? • Scale with ease • Always on • Deploy across data centers • Enterprise-ready capabilities • 24x7x365 support How to Evaluate?
    22. 22. Evaluation Process Download& installbinaries or sandbox Leverageusecasesto identifyneeds InstallDSE/OpsCenteron servers Design/Modifydatamodel Implementdata model Load sampledata Stresstest servers Developapplication 1) R&D Mode 2) POC Cycle 3) Optimize Add Nodes (C*, SOLR, and/orHadoop)
    23. 23. A Typical POC Environment • Ideally at least 4 nodes, RF=3 • Hardware per node: • At least 8 core • At least16 GBs RAM (more the better) • SSD physically attached • Linux (ideally 3.x for improved buffered cache) • Each environment has its own steps/requirements: • EC2, Rackspace, Google Compute, Other cloud providers • In-house servers • In-house servers VM
    24. 24. Tailored to Meet Your Needs Confidential 25 FREE Resources PAID Services DSE Sandbox DSE for Non-Production OpsCenter (Standard) DevCenter DataStax Academy Community Forums White Papers & Documentation Onsite Consulting Remote Consulting Onsite Training Public Training PAID Subscription Production DSE Pro Production DSE Standard Non-Production DSE Max Non-Production DSE Pro Non-Production DSE Standard Production DSE Max PAID Bundles Quick Start Enterprise Quick Start Standard  Customer Success Manager  Proactive Guidance  Free Health Check  Free MigrationAssessment  Monthly Bulletin Best Practices Customer Benefits
    25. 25. The Right Mix of Support Resources Confidential 26 Education & Training Planning & Design Develop & Test Training Consulting Support How to use DataStax Enterprise Learn DataStax admin features How to use integrated search How to use integrated analytics DataStax Enterprise architecture Data modeling with DataStax Cluster tuning and performance Best practices and planning Troubleshooting errors Experiencing unexpected results Clarification on documentation Critical issue support Production Support
    26. 26. Available Online Resources • Patrick McFadin’s data modeling series • CQL/Data modeling on DataStax • Virtual training • Java driver sample code • SOLR documentation and tutorial on DataStax • Analytics documentation • Github code samples • Advance time series best practices Massively Scale a DB!
    27. 27. Agenda Confidential 28 Why Cassandra? Why DataStax Enterprise? • Scale with ease • Always on • Deploy across data centers • Enterprise-ready capabilities • 24x7x365 support How to Evaluate? • Evaluate efficiently
    28. 28. Q&A and Next Steps Confidential 29 Want to learn more about the evaluation process? • Contact your account manager or email us at sales@datastax.com Want access to more Cassandra resources? • Visit Planet Cassandra at www.planetcassandra.com
    29. 29. Appendix
    30. 30. EC2 Install Process with Linux AMI’s • Read through ec2 production planning: http://www.datastax.com/documentation/cassandra/2.0/cassandra/architecture/architecturePlanningEC2 _c.html • Go for i2.2xlarge to i2.4xlarge • Create security group: http://www.datastax.com/documentation/datastax_enterprise/4.0/datastax_enterprise/install/installAMIse curity.html • Pick a reputable reliable Linux flavored image to start with - preferably an image with the 3.x kernel on it • Run through the wizard and start AMI's up • Install the prereq's: http://www.datastax.com/documentation/cassandra/2.0/cassandra/install/installJREJNAabout_c.html • Install dse node (depends on OS): http://www.datastax.com/documentation/datastax_enterprise/4.0/datastax_enterprise/install/installTOC.ht ml • Following the "what's next at the bottom of installation instructions, including configuring dse node multidc or single dc (topology should be planned for): http://www.datastax.com/documentation/datastax_enterprise/4.0/datastax_enterprise/deploy/deploySingl eDC.html#deploySingleDC or http://www.datastax.com/documentation/datastax_enterprise/4.0/datastax_enterprise/deploy/deployMulti DC.html#deployMultiDC • Follow and set recommended production settings: http://www.datastax.com/documentation/cassandra/2.0/cassandra/install/installRecommendSettings.html
    31. 31. Cassandra Architecture Basics – One Node Organizes Data in Partitions Inserted data is written to a Commit Log As well as a MemTable MemTables are flushed to disk in an SSTable based on size. SSTables are immutable Changes to a partition are written to additional SSTables. Deletes write tombstones Node 1 Row Data Partition Key 75 Row Data Partition Key 9
    32. 32. Background – How Cassandra Stores Data Model brought from BigTable* Partition key and a lot of cells Cell names sorted (UTF8, Int, Timestamp, etc) • CQL creates timestamp if not specified Partition key Cell Name ... Cell Name Cell Value Cell Value Timestamp Timestamp TTL TTL 1 2 Billion ©2013 DataStax Confidential. Do not distribute without consent. 33
    33. 33. Node 1 Node 2Node 5 Node 3Node 4 Row Data23 Row Data76 Row Data23 Row Data23 Row Data76 Row Data76 Cassandra Architecture Basics – Multi Data Center • Nodes can be arranged in multiple data centers • Cassandra replicates data efficiently between remote data centers • Each data center can have a different RF • Use data centers to segment nodes for different query patterns Boston San FranciscoReal Time Analytics
    34. 34. Reading Data ©2013 DataStax Confidential. Do not distribute without consent. Slide 35 /* Demonstrate an easy way to query data. */ try { ResultSet result = session.execute ( "SELECT password from user " + "WHERE username = 'user2';"); if (result.isExhausted()) return; Row user = result.one(); System.out.println("Password is: " + user.getString("password")); } catch (NoHostAvailableException ex) { System.out.println("No Host Available"); } catch (QueryValidationException ex) { System.out.println(“Requested consistency” + “level not met”); }
    35. 35. ©2013 DataStax Confidential. Do not distribute without consent. Slide 36 Prepared Statements PreparedStatement statement = session.prepare( "INSERT INTO user (username, password) " + "VALUES (?, ?);"); BoundStatement boundStatement = new BoundStatement(statement); try { session.execute(boundStatement.bind("user4”,"user4password")); } catch (NoHostAvailableException ex) { System.out.println("Host Not Available"); } catch (QueryExecutionException ex) { System.out.println (”Syntax error, runtime, not authorized"); } catch (QueryValidationException ex) { System.out.println ("Requested consistency level not met"); }
    36. 36. Query-Driven Data Modeling ©2013 DataStax Confidential. Do not distribute without consent. 37 Start by addressing the queries that you will need to answer • Your data should be able to match it directly Think about: • The actions your application needs to perform • How you want to access the data • What are the use cases? • What does the data look like?
    37. 37. Queries (cont) What are you trying to retrieve • Does it need to be ordered? • Is there any nesting of data? • Do you need to group data? • Do you need to filter data? Does data expire? Does data need to be retrieved in chronological order? ©2013 DataStax Confidential. Do not distribute without consent. 38
    38. 38. Relational Concept - Denormalization • Combine table columns into a single view • No joins • All in how you set the data for fast reads Employees SELECT First, Last, Dept FROM employees WHERE id = ‘1’; id First Last Dept 1 Edgar Codd Engineeri ng 2 Raymond Boyce Math ©2013 DataStax Confidential. Do not distribute without consent. 39
    39. 39. • Examples: medical device, energy devices/equipment, financial data • Application for sensors, clickstreams, historical data • Typical very high volume writes required • Usually coupled with need to analyze data or search using real-time analytics • Great fit for DSE Cassandra, SOLR, Analytics Nodes Time Series – Patterns ©2013 DataStax Confidential. Do not distribute without consent. Slide 40 StationID Timestamp Value/s Timestamp Value/s 1…N FLGAZ101 20130611T01:01: 01 74.34 20130611T01:01: 11 74.28 20130611T01:01: 21 74.41
    40. 40. Hardware • Ideal node: • Processor: CPU 8 cores, • Memory: RAM 16 - 64 GB, with 8 GB of Heap, • Network: at least a Gigabit card, • Disks: lots of small disks using JBOD or basic RAIDs (0 or 10), but prefer SSDs • Exact needs vary by use case • Production planning: • http://www.datastax.com/documentation/cassandra/1.2/we bhelp/index.html#cassandra/architecture/architecturePlann ingHardware_c.html
    41. 41. Cassandra Query Language (CQL) • Very similar to RDBMS SQL syntax • Create objects via DDL (e.g. CREATE…) • Core DML commands supported: INSERT, UPDATE, DELETE • Query data with SELECT • Leverage Java drivers to execute queries via PreparedStatements and ResultSets SELECT * FROM USERS WHERE STATE = ‘TX’;
    42. 42. Cl ie nt SSTable Memory SSTables Commit Log Flush to Disk Cassandra is Durable Data is organized into Partitions Inserted data is written to a Commit Log for a node As well as a MemTable MemTables are flushed to disk in an SSTable based on size. SSTables are immutable
    43. 43. Overview of Replication in Cassandra • Replication is controlled by what is called the replication factor. A replication factor of 1 means there is only one copy of a row in a cluster. A replication factor of 2 means there are two copies of a row stored in a cluster • Replication is controlled at the keyspace level in Cassandra Original row Copy of row Replication Factor (RF) determines additional nodes that get a copy of the partition Eg. RF=3 Copy of row
    44. 44. • The schema used in Cassandra is modeled after after Google Bigtable. It is a row-oriented, column structure • A keyspace is akin to a database in the RDBMS world • A column family is similar to an RDBMS table but is more flexible/dynamic • A row in a column family is indexed by its key ID Name SSN DOB Portfolio Keyspace Customer Column Family Data Model
    45. 45. Tunable Data Consistency • Choose between strong and eventual consistency (one to all responding) depending on the need • Can be done on a per-operation basis, and for both reads and writes • Handles multi-data center operations • Any • One • Quorum • Local_Quorum • Each_Quorum • All Writes • One • Quorum • Local_Quorum • Each_Quorum • All Reads
    46. 46. Thank You

    ×