Jesus Alberto Guzmán Polanco
jguzman@datum.com.gt
Apache Cassandra Certified @Datum
• Cassandra Overview
• Cassandra Architecture
• Data Modeling
• Datastax Enterprise
"Apache Cassandra is an open source, distributed, decentralized, elastically
scalable, highly available, fault-tolerant, tuneably consistent, column-
oriented database that bases its distribution design on Amazon's Dynamo and
its data model on Google's Bigtable. Created at Facebook, it is now used at
some of the most popular sites on the Web."
Cassandra: The Definitive Guide.
BigTable Dynamo
• Must always be available
• 100% uptime
• Must be easy to manage and maintain
• Linear scalability at lowest cost
• Big Data
• Operational (OLTP) Data Store
• Masterless - No single point of failure
• Always on
• Linear scale performance
• Fast response times
• Always on reliability
• Data replication across multiple data centers and the cloud
• Large amounts of structured, semi-structured, and unstructured data
• Designed expecting failure
• Data partitioned among all nodes in the cluster
• Configurable data replication to ensure uptime
• Linear scalability (performance / storage)
• Keyspace
• Identified by name
• Contains tables ("column families")
• Determines replication factor
• Table
• Identified by name
• Has rows
• Row
• Contains columns (up to 2 billion!)
• Can have different number of columns
• Column
• Identified by name
• Has data type
• Node: A single instance of Cassandra
• Rack: A logical grouping of nodes (optional)
• Data Center: A logical grouping of racks or nodes
• Cluster: A logical grouping of data centers (1 to N)
• Required for each table
• Uniquely identifies row
• Partition Key
• Determines node
• Has one or more columns
• Cluster Key
• Determines disk location (order)
• Has zero or more columns
• Binary search
• Search by: >, >=, <=, <, =
Three Key concepts
• Partitioning (data distribution)
• Replication (fault tolerance)
• Consistency (performance tunable)
• Partitioner
• Generate tokens
• Data distribution
• Partition Keys are hashed into 128bit
• Murmur3 default
Node 1
Node 3
Node 2Node 4
- 263+ 263
• Simplified Token Range: Integers from 0 -> 100
Node 1
Node
3
Node 2Node
4
0100
25
50
75
ID NAME DOB
AB1 John Smith 10/11/1972
AB2 Bob Jones 3/1/1964
ZZ3 Mike West 4/22/1968
WX2 Sally
Thompson
10/15/1969
MNZ Bill Wright 6/6/1966
HASH 17
HASH 79
HASH 14
HASH 32
HASH 51
Node 2
Node 1
Node 2
Node 3
Node 4
• Provides fault tolerance
• Provides geographic distribution
• Copies of each partition are distributed to data centers
• Defined on a schema level (Replication Factor)
RF =1 RF = 2 RF = 3
A123 | JOHN SMITH | 11234
A147 | BOB MARTIN | 32235
B212 | JEN JONES | 43323
• Higher Replication Levels = Greater Fault Tolerance
RF =1 RF = 2 RF = 3
• Assign Replication Factor for each Data Center and schema
APP {
Toronto : 3
San Francisco : 3
Dubai : 3
New York : 3
}
San
Francisco
New York
Dubai
Toronto
• It is the number of REPLICAS that need to respond for a request to be
considered complete (reads and write/updates)
• Consistency Level can is set on every request (normally by default)
DC 1 DC 2
Some Consistency Levels
• Any** (Hints, only in write)
• ONE – one replica must respond
• Quorum – 51% of replicas must respond
• Local_Quorum – 51% of replicas in local data
center
• ALL – all replicas must respond DC 1 DC 2
RF=3 RF=3
How it works in Cassandra
WRITING DATA
RF=3 RF=3
CLIENT
CONSISTENCY LEVEL
LOCAL_QUORUM
How it works in Cassandra
READING DATA
CLIENT
CONSISTENCY LEVEL
ONE
Common:
• One
• Local_Quorum Reads / Writes
• Light Weight Transactions (LWT)
• Application Level Locking (ING*)
DC 1 DC 2
RF=3 RF=3
• Operation = Write/Read
• Operation = Write/Read
• Hints
Coordinator stores missed mutations for later replay
Time out after 3 hours
• Read Repair
• Mismatched results at read trigger a repair for that partition
• Read Repair Chance setting triggers validation of all replicas on small
percentage of reads
• Repair
• Process run on Node / Keyspace to true up data
• Can be run automatically via Opscenter in DSE
• Ensures tombstones are properly evicted during compaction
• Snapshots
• By table, keyspace, node, cluster
• So fast
• So Hard-Link
• Do you need Backups ?
• Data replication
• Data across all nodes
• Cassandra is not an RDBMS
• Distributed changes the rules
• OLTP (not Analytics / Search / ad hoc query)
• Rows are accessed by Partition Key
• De-normalization (No joins)
• Multiple query tables
• Use Solr for Search, Hadoop/Spark for Analytics
• Cassandra Query Language (CQL) is a query language
for the Cassandra database.
• A SQL-like query language for communicating with
Cassandra
• CQLSH
• No Joins
• JSON support
• Upserts
• TTL
• Timestamps
• Collections:
• Set
• List
• Map
• User defined types (UTD)
• Tuples
Track customer transactions by type
DATE CUST_ID TYPE  TIME  CUST NAME LOCATION AMOUNT
PARTITION KEY CLUSTERING COLUMNS
PRIMARY KEY
Track customer transactions by type
DATE CUST_ID TYPE  TIME  CUST NAME LOCATION AMOUNT
10/15/14 A11 DEPOSIT 09:24:33.55 JOHN SMITH 30132 252.50
10/15/14 A11 DEPOSIT 09:25:53.21 JOHN SMITH 30132 63.49
10/15/14 A11 WITHDRAW 12:45:22.23 JOHN SMITH 30060 -300.00
10/15/14 B23 DEPOSIT 08:12:22.32 BOB BARKER 94123 500.00
Partition size considerations
• Defines transitions between models
• Query-driven methodology
• Formal analysis and validation
• Defines a scientific approach to data
modeling
• Modeling rules
• Mapping patterns
• Schema optimization techniques
• ER diagram (Chen notation)
• Describes entities, relationships, roles, keys, cardinalities
• What is possible and what is not in existing or future data
Simple Order Management (queries)
• Q1: Customers by Customer ID
• Q2: Customer by email
• Q3: Product by Product ID
• Q4: Product by Name
• Q5: Product By Category
• Q6: Order Details by Order ID
• Q7: Order Details by Customer / Date
• Logical-level shows column names and properties
• Physical-level also shows the column data type
Founded in April 2010
~40 600+
Santa Clara, Austin, New York, London, Paris
480+
Employees Percent Customers
• Certified Production Cassandra
• Enterprise Security Options
• Integrated Search
• Integrated Analytics (Spark)
• DSE Graph
• Workload Segregation
• In Memory
• OpsCenter
• Management Services
• MDM: Customer 360, Product Catalog
• Personalization and Recommendation
• Internet of Things and Time Series
• Fraud Detection
• List Management
• Messaging
• Inventory Management
• Authentication
• Visual, browser-based user interface.
• Installation, configuration, and administration tasks
carried out in point-and-click fashion.
• Visually supports DataStax Automatic Management
Services.
Introduction to Apache Cassandra

Introduction to Apache Cassandra

  • 2.
    Jesus Alberto GuzmánPolanco jguzman@datum.com.gt Apache Cassandra Certified @Datum
  • 3.
    • Cassandra Overview •Cassandra Architecture • Data Modeling • Datastax Enterprise
  • 7.
    "Apache Cassandra isan open source, distributed, decentralized, elastically scalable, highly available, fault-tolerant, tuneably consistent, column- oriented database that bases its distribution design on Amazon's Dynamo and its data model on Google's Bigtable. Created at Facebook, it is now used at some of the most popular sites on the Web." Cassandra: The Definitive Guide.
  • 8.
  • 9.
    • Must alwaysbe available • 100% uptime • Must be easy to manage and maintain • Linear scalability at lowest cost • Big Data
  • 10.
    • Operational (OLTP)Data Store • Masterless - No single point of failure • Always on • Linear scale performance • Fast response times • Always on reliability • Data replication across multiple data centers and the cloud • Large amounts of structured, semi-structured, and unstructured data
  • 11.
    • Designed expectingfailure • Data partitioned among all nodes in the cluster • Configurable data replication to ensure uptime • Linear scalability (performance / storage)
  • 16.
    • Keyspace • Identifiedby name • Contains tables ("column families") • Determines replication factor • Table • Identified by name • Has rows • Row • Contains columns (up to 2 billion!) • Can have different number of columns • Column • Identified by name • Has data type
  • 17.
    • Node: Asingle instance of Cassandra • Rack: A logical grouping of nodes (optional) • Data Center: A logical grouping of racks or nodes • Cluster: A logical grouping of data centers (1 to N)
  • 19.
    • Required foreach table • Uniquely identifies row • Partition Key • Determines node • Has one or more columns • Cluster Key • Determines disk location (order) • Has zero or more columns • Binary search • Search by: >, >=, <=, <, =
  • 20.
    Three Key concepts •Partitioning (data distribution) • Replication (fault tolerance) • Consistency (performance tunable)
  • 21.
    • Partitioner • Generatetokens • Data distribution • Partition Keys are hashed into 128bit • Murmur3 default Node 1 Node 3 Node 2Node 4 - 263+ 263
  • 22.
    • Simplified TokenRange: Integers from 0 -> 100 Node 1 Node 3 Node 2Node 4 0100 25 50 75 ID NAME DOB AB1 John Smith 10/11/1972 AB2 Bob Jones 3/1/1964 ZZ3 Mike West 4/22/1968 WX2 Sally Thompson 10/15/1969 MNZ Bill Wright 6/6/1966 HASH 17 HASH 79 HASH 14 HASH 32 HASH 51 Node 2 Node 1 Node 2 Node 3 Node 4
  • 23.
    • Provides faulttolerance • Provides geographic distribution • Copies of each partition are distributed to data centers • Defined on a schema level (Replication Factor) RF =1 RF = 2 RF = 3 A123 | JOHN SMITH | 11234 A147 | BOB MARTIN | 32235 B212 | JEN JONES | 43323
  • 24.
    • Higher ReplicationLevels = Greater Fault Tolerance RF =1 RF = 2 RF = 3
  • 25.
    • Assign ReplicationFactor for each Data Center and schema APP { Toronto : 3 San Francisco : 3 Dubai : 3 New York : 3 } San Francisco New York Dubai Toronto
  • 27.
    • It isthe number of REPLICAS that need to respond for a request to be considered complete (reads and write/updates) • Consistency Level can is set on every request (normally by default) DC 1 DC 2
  • 28.
    Some Consistency Levels •Any** (Hints, only in write) • ONE – one replica must respond • Quorum – 51% of replicas must respond • Local_Quorum – 51% of replicas in local data center • ALL – all replicas must respond DC 1 DC 2 RF=3 RF=3
  • 29.
    How it worksin Cassandra WRITING DATA RF=3 RF=3 CLIENT CONSISTENCY LEVEL LOCAL_QUORUM
  • 30.
    How it worksin Cassandra READING DATA CLIENT CONSISTENCY LEVEL ONE
  • 31.
    Common: • One • Local_QuorumReads / Writes • Light Weight Transactions (LWT) • Application Level Locking (ING*) DC 1 DC 2 RF=3 RF=3
  • 32.
    • Operation =Write/Read
  • 33.
    • Operation =Write/Read
  • 37.
    • Hints Coordinator storesmissed mutations for later replay Time out after 3 hours • Read Repair • Mismatched results at read trigger a repair for that partition • Read Repair Chance setting triggers validation of all replicas on small percentage of reads • Repair • Process run on Node / Keyspace to true up data • Can be run automatically via Opscenter in DSE • Ensures tombstones are properly evicted during compaction
  • 39.
    • Snapshots • Bytable, keyspace, node, cluster • So fast • So Hard-Link • Do you need Backups ? • Data replication • Data across all nodes
  • 41.
    • Cassandra isnot an RDBMS • Distributed changes the rules • OLTP (not Analytics / Search / ad hoc query) • Rows are accessed by Partition Key • De-normalization (No joins) • Multiple query tables • Use Solr for Search, Hadoop/Spark for Analytics
  • 42.
    • Cassandra QueryLanguage (CQL) is a query language for the Cassandra database. • A SQL-like query language for communicating with Cassandra • CQLSH • No Joins • JSON support • Upserts • TTL • Timestamps
  • 44.
    • Collections: • Set •List • Map • User defined types (UTD) • Tuples
  • 45.
    Track customer transactionsby type DATE CUST_ID TYPE  TIME  CUST NAME LOCATION AMOUNT PARTITION KEY CLUSTERING COLUMNS PRIMARY KEY
  • 46.
    Track customer transactionsby type DATE CUST_ID TYPE  TIME  CUST NAME LOCATION AMOUNT 10/15/14 A11 DEPOSIT 09:24:33.55 JOHN SMITH 30132 252.50 10/15/14 A11 DEPOSIT 09:25:53.21 JOHN SMITH 30132 63.49 10/15/14 A11 WITHDRAW 12:45:22.23 JOHN SMITH 30060 -300.00 10/15/14 B23 DEPOSIT 08:12:22.32 BOB BARKER 94123 500.00 Partition size considerations
  • 47.
    • Defines transitionsbetween models • Query-driven methodology • Formal analysis and validation • Defines a scientific approach to data modeling • Modeling rules • Mapping patterns • Schema optimization techniques
  • 48.
    • ER diagram(Chen notation) • Describes entities, relationships, roles, keys, cardinalities • What is possible and what is not in existing or future data
  • 49.
    Simple Order Management(queries) • Q1: Customers by Customer ID • Q2: Customer by email • Q3: Product by Product ID • Q4: Product by Name • Q5: Product By Category • Q6: Order Details by Order ID • Q7: Order Details by Customer / Date
  • 50.
    • Logical-level showscolumn names and properties • Physical-level also shows the column data type
  • 54.
    Founded in April2010 ~40 600+ Santa Clara, Austin, New York, London, Paris 480+ Employees Percent Customers
  • 55.
    • Certified ProductionCassandra • Enterprise Security Options • Integrated Search • Integrated Analytics (Spark) • DSE Graph • Workload Segregation • In Memory • OpsCenter • Management Services
  • 56.
    • MDM: Customer360, Product Catalog • Personalization and Recommendation • Internet of Things and Time Series • Fraud Detection • List Management • Messaging • Inventory Management • Authentication
  • 57.
    • Visual, browser-baseduser interface. • Installation, configuration, and administration tasks carried out in point-and-click fashion. • Visually supports DataStax Automatic Management Services.