SlideShare a Scribd company logo
1 of 12
CassandraAdmin:
CFS Data Model:
HDFS NameNode service, that tracks each files metadata and block locations, is replaced with the
“inode” column family. two Column Families represent the two primary HDFS services. The HDFS
DataNode service, that stores file blocks, is replaced with the “sblocks” Column Family.
HDFS CFS
NameNode “Inode”columnfamily
DataNode “Sblocks”columnfamily
CFS Write Path
Hadoopblockis single blocknochange injobsplit
logicof map-reduce.
Data Splitintosubblocksasitreliesonthrift(not
supportstreaming).
Sys chema like oracle
Keyspacesin Cassandra
‘inode’ Column Family contains meta information.
CFS Read Path:
When a read comes in for a file or part of a file (let’s assume Hadoop looked up the the uuid from the
secondary index) it reads the inode info and finds the block and subblock to read. CFS then
executes a custom thrift call that returns either the specified sub-block data or, if the call was made
on a node with the data locally, the file and offset information of the Cassandra SSTable file with the
subblock. It does this since during a mapreduce task the jobtracker tries to put each computation on
the node with the actual data. By using the SSTable information it is much faster, since the mapper
can access the data directly without needing to serialize/deserialize via thrift.
second column family ‘sblocks’ stores the actual contents of the file
What isreplicationFactorforNodeswhichwe have sethere?(No.of nodes)
What isreplicationStrategy?
 Cassandra workloads A Cassandra real-time application needs very rapid access to
Cassandra data.The real-time application accesses data directly by key, large sequential
blocks, or sequential slices.
KeySpace Configuration:
[default@unknown] CREATE KEYSPACE test
WITH placement_strategy = 'NetworkTopologyStrategy'
AND strategy_options={us-east:6,us-west:3};
Workload segregation¶
nodes in separate data centers run a mix of:
 Real-time queries (Cassandra and no other services)
 Analytics (either DSE Hadoop, Spark, or dual mode DSE Hadoop/Spark)
 Solr
 External Hadoop system (BYOH)
Schema in Cassandra 1.1
When Cassandra was first released if followed Google Bigtable. ColumnFamilies grouping
related columns needed to be defined up-front, but column names were just byte arrays interpreted
by the application. It would be fair to characterize this early Cassandra data model as “schemaless.”
REATE TABLE users (
id uuid PRIMARY KEY,
name varchar,
state varchar
);
ALTER TABLE users ADD birth_date INT;
(Using UUIDs as a surrogate key is common in Cassandra, so that you don’t need to
worry about sequence or autoincrement synchronization across multiple machines.)
traditional storage engines allocate room for each column in each row.
In a static-column storage engine, each row must reserve space for every column.
Cassandra’s storage engine, each row is sparse:
CQL (the Cassandra Query Language) supports defining columnfamilies with compound primary
keys. The first column in a compound key definition continues to be used as the partition key, and
remaining columns are automatically clustered: that is, all the rows sharing a given partition key will
be sorted by the remaining components of the primary key,
sblocks table in the CassandraFS data model
CREATE TABLE sblocks (
block_id uuid,
subblock_id uuid,
data blob,
PRIMARY KEY (block_id, subblock_id)
)
WITH COMPACT STORAGE;
The first element of the primary key, block_id, is the partition key, which means that all subblocks
of a given block will be routed to the same replicas.
Logical representation of the denormalized timeline rows
The physical layout of this data looks like this to Cassandra’s storage engine:
Physical representation of the denormalized timeline rows, WITH COMPACT
STORAGE
Physical representation of the denormalized timeline rows, WITH COMPACT
STORAGE
Physical representation of the denormalized timeline rows, WITH COMPACT
STORAGE
Replicationstrategy: SimpleStrategy/Network topology Strategy
SimpleStrategy:SimpleStrategy places the first replica on a node determined by the partitioner.
Additional replicas are placed on the next nodes clockwise in the ring without considering rack or
data center location: Below 3 replicas in four {ABCD} nodes.
NetworkTopologyStrategy : cluster deployed across multiple data centers
(1) being able to satisfy reads locally, without incurring cross-datacenter latency, and (2) failure
scenarios
(2) Failure Scenarios.
Asymmetrical replication groupings are also possible. For example, you can have three replicas
per data center to serve real-time application requests and use a single replica for running analytics.
NetworkTopologyStrategy determines replica placement independently within each data center
as follows:
 The first replica is placed according to the partitioner (same as with SimpleStrategy).
 Additional replicas are placed by walking the ring clockwise until a node in a different rack is
found. If no such node exists, additional replicas are placed in different nodes in the same rack.
 NetworkTopologyStrategy attempts to place replicas on distinct racks because
nodes in the same rack (or similar physical grouping) can fail at the same time
due to power, cooling, or network issues.
 Below is an example of how NetworkTopologyStrategy places replicas spanning
two data centers with a total replication factor of 4. When
using NetworkToplogyStrategy, you set the number of replicas per data center.


 In the following graphic, notice the tokens are assigned to alternating racks. For
more information, see Calculating Tokens for a Multiple Data Center Cluster.

Snitches
Snitch maps IPs to racks and data centers. It defines how the nodes are grouped together
within the overall network topology. Cassandra uses this information to route inter-node requests
as efficiently as possible.
A consistency level of ONE means that it is possible that 2 of the 3 replicas could miss the write
if they happened to be down at the time the request was made
If a replica misses a write, the row will be made consistent later via one of Cassandra's built-in
repair mechanisms: hinted handoff, read repair or anti-entropy node repair.
Cassandra's Built-in Consistency Repair Features
Read Repair:
To ensure that frequently-read data remains consistent, the coordinator compares the data from
all the remaining replicas that own the row in the background, and if they are inconsistent, issues
writes to the out-of-date replicas to update the row to reflect the most recently written values
Anti-Entropy Node Repair:
For data that is not read frequently, or to update data on a node that has been down for a while,
the nodetool repair process ensures that all data on a replica is made consistent
Hinted Handoff
If a node happens to be down at the time of write, its corresponding replicas will save hints about
the missed writes, and then handoff the affected rows once the node comes back online.
Keyspaces: container for column families and a cluster has 1 keyspace per application.
CREATE KEYSPACE keyspace_name WITH
strategy_class = 'SimpleStrategy'
AND strategy_options:replication_factor='2';
Single device per row - Time Series Pattern 1
Partitioning to limit row size - Time Series Pattern 2
The solution is to use a pattern called row partitioning by adding data to the row key to limit the
amount of columns you get per device.
Reverse order timeseries with expiring columns -
Time Series Pattern 3
Data for a dashboard application and we only want to show the last 10 temperature readings. With
TTL time to live for data value it is possible.
CREATE TABLE latest_temperatures (
weatherstation_id text,
event_time timestamp,
temperature text,
PRIMARY KEY (weatherstation_id,event_time),
) WITH CLUSTERING ORDER BY (event_time DESC);
INSERT INTO latest_temperatures(weatherstation_id,event_time,temperature) VALUES
('1234ABCD','2013-04-03 07:03:00','72F') USING TTL 20;
RDBMS Cassandra
Stop service sudo service dse stop
Justlike commitbutbefore commit. nodetool drain -h <host name>
drain node before losing
data.cassandra need not read commit
log.
Sys schema like oracle KeyspacesinCassandra
SELECT * FROM
system.schema_keyspaces;
Counter Columns¶
A counter is a special kind of column used to store a number that incrementally counts the occurrences of
a particular event or process. For example, you might use a counter column to count the number of times
a page is viewed.

More Related Content

What's hot

Cassandra background-and-architecture
Cassandra background-and-architectureCassandra background-and-architecture
Cassandra background-and-architecture
Markus Klems
 
Cassandra overview
Cassandra overviewCassandra overview
Cassandra overview
Sean Murphy
 

What's hot (20)

Introduction to Cassandra
Introduction to CassandraIntroduction to Cassandra
Introduction to Cassandra
 
Cassandra - A Distributed Database System
Cassandra - A Distributed Database System Cassandra - A Distributed Database System
Cassandra - A Distributed Database System
 
Apache Cassandra overview
Apache Cassandra overviewApache Cassandra overview
Apache Cassandra overview
 
Introduction to NoSQL & Apache Cassandra
Introduction to NoSQL & Apache CassandraIntroduction to NoSQL & Apache Cassandra
Introduction to NoSQL & Apache Cassandra
 
Features of Hadoop
Features of HadoopFeatures of Hadoop
Features of Hadoop
 
Cassandra ppt 2
Cassandra ppt 2Cassandra ppt 2
Cassandra ppt 2
 
Cassandra background-and-architecture
Cassandra background-and-architectureCassandra background-and-architecture
Cassandra background-and-architecture
 
Time series database by Harshil Ambagade
Time series database by Harshil AmbagadeTime series database by Harshil Ambagade
Time series database by Harshil Ambagade
 
Write intensive workloads and lsm trees
Write intensive workloads and lsm treesWrite intensive workloads and lsm trees
Write intensive workloads and lsm trees
 
Hadoop Research
Hadoop Research Hadoop Research
Hadoop Research
 
Cassandra an overview
Cassandra an overviewCassandra an overview
Cassandra an overview
 
Cassandra Database
Cassandra DatabaseCassandra Database
Cassandra Database
 
No sql
No sqlNo sql
No sql
 
Basics of Distributed Systems - Distributed Storage
Basics of Distributed Systems - Distributed StorageBasics of Distributed Systems - Distributed Storage
Basics of Distributed Systems - Distributed Storage
 
Cassandra synergy
Cassandra synergyCassandra synergy
Cassandra synergy
 
Cassandra overview
Cassandra overviewCassandra overview
Cassandra overview
 
Resilient Distributed Datasets
Resilient Distributed DatasetsResilient Distributed Datasets
Resilient Distributed Datasets
 
The design and implementation of modern column oriented databases
The design and implementation of modern column oriented databasesThe design and implementation of modern column oriented databases
The design and implementation of modern column oriented databases
 
Cassandra tutorial
Cassandra tutorialCassandra tutorial
Cassandra tutorial
 
Cassandra for Sysadmins
Cassandra for SysadminsCassandra for Sysadmins
Cassandra for Sysadmins
 

Viewers also liked

Viewers also liked (10)

Data stax no sql use cases
Data stax  no sql use casesData stax  no sql use cases
Data stax no sql use cases
 
Saas security
Saas securitySaas security
Saas security
 
Iam cloud security_vision_wp_236732
Iam cloud security_vision_wp_236732Iam cloud security_vision_wp_236732
Iam cloud security_vision_wp_236732
 
Cloud Strategy Architecture for multi country deployment
Cloud Strategy Architecture for multi country deploymentCloud Strategy Architecture for multi country deployment
Cloud Strategy Architecture for multi country deployment
 
Cassandra Configuration
Cassandra ConfigurationCassandra Configuration
Cassandra Configuration
 
Cloud Security Alliance Guide to Cloud Security
Cloud Security Alliance Guide to Cloud SecurityCloud Security Alliance Guide to Cloud Security
Cloud Security Alliance Guide to Cloud Security
 
Cassandra data modelling best practices
Cassandra data modelling best practicesCassandra data modelling best practices
Cassandra data modelling best practices
 
Cassandra database design best practises
Cassandra database design best practisesCassandra database design best practises
Cassandra database design best practises
 
Overcoming cassandra query limitation spark
Overcoming cassandra query limitation sparkOvercoming cassandra query limitation spark
Overcoming cassandra query limitation spark
 
Solution Architecture - AWS
Solution Architecture - AWSSolution Architecture - AWS
Solution Architecture - AWS
 

Similar to Cassandra no sql ecosystem

Storage cassandra
Storage   cassandraStorage   cassandra
Storage cassandra
PL dream
 

Similar to Cassandra no sql ecosystem (20)

Cassndra (4).pptx
Cassndra (4).pptxCassndra (4).pptx
Cassndra (4).pptx
 
CASSANDRA A DISTRIBUTED NOSQL DATABASE FOR HOTEL MANAGEMENT SYSTEM
CASSANDRA A DISTRIBUTED NOSQL DATABASE FOR HOTEL MANAGEMENT SYSTEMCASSANDRA A DISTRIBUTED NOSQL DATABASE FOR HOTEL MANAGEMENT SYSTEM
CASSANDRA A DISTRIBUTED NOSQL DATABASE FOR HOTEL MANAGEMENT SYSTEM
 
Dynamo cassandra
Dynamo cassandraDynamo cassandra
Dynamo cassandra
 
cassandra
cassandracassandra
cassandra
 
Cassandra architecture
Cassandra architectureCassandra architecture
Cassandra architecture
 
Storage cassandra
Storage   cassandraStorage   cassandra
Storage cassandra
 
Apache Cassandra, part 1 – principles, data model
Apache Cassandra, part 1 – principles, data modelApache Cassandra, part 1 – principles, data model
Apache Cassandra, part 1 – principles, data model
 
Cassandra - A decentralized storage system
Cassandra - A decentralized storage systemCassandra - A decentralized storage system
Cassandra - A decentralized storage system
 
Cassandra Tutorial | Data types | Why Cassandra for Big Data
 Cassandra Tutorial | Data types | Why Cassandra for Big Data Cassandra Tutorial | Data types | Why Cassandra for Big Data
Cassandra Tutorial | Data types | Why Cassandra for Big Data
 
Learning Cassandra NoSQL
Learning Cassandra NoSQLLearning Cassandra NoSQL
Learning Cassandra NoSQL
 
NoSql Database
NoSql DatabaseNoSql Database
NoSql Database
 
Nosql databases
Nosql databasesNosql databases
Nosql databases
 
Apache Cassandra at the Geek2Geek Berlin
Apache Cassandra at the Geek2Geek BerlinApache Cassandra at the Geek2Geek Berlin
Apache Cassandra at the Geek2Geek Berlin
 
Cassandra & Python - Springfield MO User Group
Cassandra & Python - Springfield MO User GroupCassandra & Python - Springfield MO User Group
Cassandra & Python - Springfield MO User Group
 
Cassandra consistency
Cassandra consistencyCassandra consistency
Cassandra consistency
 
White paper on cassandra
White paper on cassandraWhite paper on cassandra
White paper on cassandra
 
Cassandra internals
Cassandra internalsCassandra internals
Cassandra internals
 
Data processing platforms with SMACK: Spark and Mesos internals
Data processing platforms with SMACK:  Spark and Mesos internalsData processing platforms with SMACK:  Spark and Mesos internals
Data processing platforms with SMACK: Spark and Mesos internals
 
The Apache Cassandra ecosystem
The Apache Cassandra ecosystemThe Apache Cassandra ecosystem
The Apache Cassandra ecosystem
 
Data Storage Management
Data Storage ManagementData Storage Management
Data Storage Management
 

More from Sandeep Sharma IIMK Smart City,IoT,Bigdata,Cloud,BI,DW

More from Sandeep Sharma IIMK Smart City,IoT,Bigdata,Cloud,BI,DW (20)

Management Consultancy Saudi Telecom Digital Transformation Design Thinking
Management Consultancy Saudi Telecom Digital Transformation Design ThinkingManagement Consultancy Saudi Telecom Digital Transformation Design Thinking
Management Consultancy Saudi Telecom Digital Transformation Design Thinking
 
Major new initiatives
Major new initiativesMajor new initiatives
Major new initiatives
 
Digital transformation journey Consulting
Digital transformation journey ConsultingDigital transformation journey Consulting
Digital transformation journey Consulting
 
Agile Jira Reporting
Agile Jira Reporting Agile Jira Reporting
Agile Jira Reporting
 
Lnt and bbby Retail Houseare industry Case assignment sandeep sharma
Lnt and bbby Retail Houseare industry Case assignment  sandeep sharmaLnt and bbby Retail Houseare industry Case assignment  sandeep sharma
Lnt and bbby Retail Houseare industry Case assignment sandeep sharma
 
Risk management Consulting For Municipality
Risk management Consulting For MunicipalityRisk management Consulting For Municipality
Risk management Consulting For Municipality
 
GDPR And Privacy By design Consultancy
GDPR And Privacy By design ConsultancyGDPR And Privacy By design Consultancy
GDPR And Privacy By design Consultancy
 
Real implementation Blockchain Best Use Cases Examples
Real implementation Blockchain Best Use Cases ExamplesReal implementation Blockchain Best Use Cases Examples
Real implementation Blockchain Best Use Cases Examples
 
Ffd 05 2012
Ffd 05 2012Ffd 05 2012
Ffd 05 2012
 
Biztalk architecture for Configured SMS service
Biztalk architecture for Configured SMS serviceBiztalk architecture for Configured SMS service
Biztalk architecture for Configured SMS service
 
Data modelling interview question
Data modelling interview questionData modelling interview question
Data modelling interview question
 
Pmo best practices
Pmo best practicesPmo best practices
Pmo best practices
 
Agile project management
Agile project managementAgile project management
Agile project management
 
Enroll hostel Business Model
Enroll hostel Business ModelEnroll hostel Business Model
Enroll hostel Business Model
 
Cloud manager client provisioning guideline draft 1.0
Cloud manager client provisioning guideline draft 1.0Cloud manager client provisioning guideline draft 1.0
Cloud manager client provisioning guideline draft 1.0
 
Bpm digital transformation
Bpm digital transformationBpm digital transformation
Bpm digital transformation
 
Digital transformation explained
Digital transformation explainedDigital transformation explained
Digital transformation explained
 
Government Digital transformation trend draft 1.0
Government Digital transformation trend draft 1.0Government Digital transformation trend draft 1.0
Government Digital transformation trend draft 1.0
 
Enterprise architecture maturity rating draft 1.0
Enterprise architecture maturity rating draft 1.0Enterprise architecture maturity rating draft 1.0
Enterprise architecture maturity rating draft 1.0
 
Organisation Structure For digital Transformation Team
Organisation Structure For digital Transformation TeamOrganisation Structure For digital Transformation Team
Organisation Structure For digital Transformation Team
 

Recently uploaded

TrustArc Webinar - Unified Trust Center for Privacy, Security, Compliance, an...
TrustArc Webinar - Unified Trust Center for Privacy, Security, Compliance, an...TrustArc Webinar - Unified Trust Center for Privacy, Security, Compliance, an...
TrustArc Webinar - Unified Trust Center for Privacy, Security, Compliance, an...
TrustArc
 
Modular Monolith - a Practical Alternative to Microservices @ Devoxx UK 2024
Modular Monolith - a Practical Alternative to Microservices @ Devoxx UK 2024Modular Monolith - a Practical Alternative to Microservices @ Devoxx UK 2024
Modular Monolith - a Practical Alternative to Microservices @ Devoxx UK 2024
Victor Rentea
 
Cloud Frontiers: A Deep Dive into Serverless Spatial Data and FME
Cloud Frontiers:  A Deep Dive into Serverless Spatial Data and FMECloud Frontiers:  A Deep Dive into Serverless Spatial Data and FME
Cloud Frontiers: A Deep Dive into Serverless Spatial Data and FME
Safe Software
 

Recently uploaded (20)

Introduction to Multilingual Retrieval Augmented Generation (RAG)
Introduction to Multilingual Retrieval Augmented Generation (RAG)Introduction to Multilingual Retrieval Augmented Generation (RAG)
Introduction to Multilingual Retrieval Augmented Generation (RAG)
 
Simplifying Mobile A11y Presentation.pptx
Simplifying Mobile A11y Presentation.pptxSimplifying Mobile A11y Presentation.pptx
Simplifying Mobile A11y Presentation.pptx
 
Polkadot JAM Slides - Token2049 - By Dr. Gavin Wood
Polkadot JAM Slides - Token2049 - By Dr. Gavin WoodPolkadot JAM Slides - Token2049 - By Dr. Gavin Wood
Polkadot JAM Slides - Token2049 - By Dr. Gavin Wood
 
Stronger Together: Developing an Organizational Strategy for Accessible Desig...
Stronger Together: Developing an Organizational Strategy for Accessible Desig...Stronger Together: Developing an Organizational Strategy for Accessible Desig...
Stronger Together: Developing an Organizational Strategy for Accessible Desig...
 
TrustArc Webinar - Unlock the Power of AI-Driven Data Discovery
TrustArc Webinar - Unlock the Power of AI-Driven Data DiscoveryTrustArc Webinar - Unlock the Power of AI-Driven Data Discovery
TrustArc Webinar - Unlock the Power of AI-Driven Data Discovery
 
CNIC Information System with Pakdata Cf In Pakistan
CNIC Information System with Pakdata Cf In PakistanCNIC Information System with Pakdata Cf In Pakistan
CNIC Information System with Pakdata Cf In Pakistan
 
TrustArc Webinar - Unified Trust Center for Privacy, Security, Compliance, an...
TrustArc Webinar - Unified Trust Center for Privacy, Security, Compliance, an...TrustArc Webinar - Unified Trust Center for Privacy, Security, Compliance, an...
TrustArc Webinar - Unified Trust Center for Privacy, Security, Compliance, an...
 
DEV meet-up UiPath Document Understanding May 7 2024 Amsterdam
DEV meet-up UiPath Document Understanding May 7 2024 AmsterdamDEV meet-up UiPath Document Understanding May 7 2024 Amsterdam
DEV meet-up UiPath Document Understanding May 7 2024 Amsterdam
 
Modular Monolith - a Practical Alternative to Microservices @ Devoxx UK 2024
Modular Monolith - a Practical Alternative to Microservices @ Devoxx UK 2024Modular Monolith - a Practical Alternative to Microservices @ Devoxx UK 2024
Modular Monolith - a Practical Alternative to Microservices @ Devoxx UK 2024
 
WSO2 Micro Integrator for Enterprise Integration in a Decentralized, Microser...
WSO2 Micro Integrator for Enterprise Integration in a Decentralized, Microser...WSO2 Micro Integrator for Enterprise Integration in a Decentralized, Microser...
WSO2 Micro Integrator for Enterprise Integration in a Decentralized, Microser...
 
API Governance and Monetization - The evolution of API governance
API Governance and Monetization -  The evolution of API governanceAPI Governance and Monetization -  The evolution of API governance
API Governance and Monetization - The evolution of API governance
 
MINDCTI Revenue Release Quarter One 2024
MINDCTI Revenue Release Quarter One 2024MINDCTI Revenue Release Quarter One 2024
MINDCTI Revenue Release Quarter One 2024
 
Platformless Horizons for Digital Adaptability
Platformless Horizons for Digital AdaptabilityPlatformless Horizons for Digital Adaptability
Platformless Horizons for Digital Adaptability
 
Connector Corner: Accelerate revenue generation using UiPath API-centric busi...
Connector Corner: Accelerate revenue generation using UiPath API-centric busi...Connector Corner: Accelerate revenue generation using UiPath API-centric busi...
Connector Corner: Accelerate revenue generation using UiPath API-centric busi...
 
Cloud Frontiers: A Deep Dive into Serverless Spatial Data and FME
Cloud Frontiers:  A Deep Dive into Serverless Spatial Data and FMECloud Frontiers:  A Deep Dive into Serverless Spatial Data and FME
Cloud Frontiers: A Deep Dive into Serverless Spatial Data and FME
 
TEST BANK For Principles of Anatomy and Physiology, 16th Edition by Gerard J....
TEST BANK For Principles of Anatomy and Physiology, 16th Edition by Gerard J....TEST BANK For Principles of Anatomy and Physiology, 16th Edition by Gerard J....
TEST BANK For Principles of Anatomy and Physiology, 16th Edition by Gerard J....
 
Vector Search -An Introduction in Oracle Database 23ai.pptx
Vector Search -An Introduction in Oracle Database 23ai.pptxVector Search -An Introduction in Oracle Database 23ai.pptx
Vector Search -An Introduction in Oracle Database 23ai.pptx
 
Exploring Multimodal Embeddings with Milvus
Exploring Multimodal Embeddings with MilvusExploring Multimodal Embeddings with Milvus
Exploring Multimodal Embeddings with Milvus
 
Understanding the FAA Part 107 License ..
Understanding the FAA Part 107 License ..Understanding the FAA Part 107 License ..
Understanding the FAA Part 107 License ..
 
[BuildWithAI] Introduction to Gemini.pdf
[BuildWithAI] Introduction to Gemini.pdf[BuildWithAI] Introduction to Gemini.pdf
[BuildWithAI] Introduction to Gemini.pdf
 

Cassandra no sql ecosystem

  • 2. HDFS NameNode service, that tracks each files metadata and block locations, is replaced with the “inode” column family. two Column Families represent the two primary HDFS services. The HDFS DataNode service, that stores file blocks, is replaced with the “sblocks” Column Family. HDFS CFS NameNode “Inode”columnfamily DataNode “Sblocks”columnfamily CFS Write Path Hadoopblockis single blocknochange injobsplit logicof map-reduce. Data Splitintosubblocksasitreliesonthrift(not supportstreaming). Sys chema like oracle Keyspacesin Cassandra ‘inode’ Column Family contains meta information. CFS Read Path: When a read comes in for a file or part of a file (let’s assume Hadoop looked up the the uuid from the secondary index) it reads the inode info and finds the block and subblock to read. CFS then executes a custom thrift call that returns either the specified sub-block data or, if the call was made on a node with the data locally, the file and offset information of the Cassandra SSTable file with the subblock. It does this since during a mapreduce task the jobtracker tries to put each computation on the node with the actual data. By using the SSTable information it is much faster, since the mapper can access the data directly without needing to serialize/deserialize via thrift.
  • 3. second column family ‘sblocks’ stores the actual contents of the file What isreplicationFactorforNodeswhichwe have sethere?(No.of nodes) What isreplicationStrategy?  Cassandra workloads A Cassandra real-time application needs very rapid access to Cassandra data.The real-time application accesses data directly by key, large sequential blocks, or sequential slices. KeySpace Configuration: [default@unknown] CREATE KEYSPACE test WITH placement_strategy = 'NetworkTopologyStrategy' AND strategy_options={us-east:6,us-west:3};
  • 4. Workload segregation¶ nodes in separate data centers run a mix of:  Real-time queries (Cassandra and no other services)  Analytics (either DSE Hadoop, Spark, or dual mode DSE Hadoop/Spark)  Solr  External Hadoop system (BYOH) Schema in Cassandra 1.1 When Cassandra was first released if followed Google Bigtable. ColumnFamilies grouping related columns needed to be defined up-front, but column names were just byte arrays interpreted by the application. It would be fair to characterize this early Cassandra data model as “schemaless.” REATE TABLE users ( id uuid PRIMARY KEY, name varchar, state varchar ); ALTER TABLE users ADD birth_date INT; (Using UUIDs as a surrogate key is common in Cassandra, so that you don’t need to worry about sequence or autoincrement synchronization across multiple machines.) traditional storage engines allocate room for each column in each row.
  • 5. In a static-column storage engine, each row must reserve space for every column. Cassandra’s storage engine, each row is sparse: CQL (the Cassandra Query Language) supports defining columnfamilies with compound primary keys. The first column in a compound key definition continues to be used as the partition key, and remaining columns are automatically clustered: that is, all the rows sharing a given partition key will be sorted by the remaining components of the primary key, sblocks table in the CassandraFS data model CREATE TABLE sblocks ( block_id uuid, subblock_id uuid, data blob, PRIMARY KEY (block_id, subblock_id) ) WITH COMPACT STORAGE;
  • 6. The first element of the primary key, block_id, is the partition key, which means that all subblocks of a given block will be routed to the same replicas. Logical representation of the denormalized timeline rows The physical layout of this data looks like this to Cassandra’s storage engine: Physical representation of the denormalized timeline rows, WITH COMPACT STORAGE
  • 7. Physical representation of the denormalized timeline rows, WITH COMPACT STORAGE Physical representation of the denormalized timeline rows, WITH COMPACT STORAGE
  • 8. Replicationstrategy: SimpleStrategy/Network topology Strategy SimpleStrategy:SimpleStrategy places the first replica on a node determined by the partitioner. Additional replicas are placed on the next nodes clockwise in the ring without considering rack or data center location: Below 3 replicas in four {ABCD} nodes. NetworkTopologyStrategy : cluster deployed across multiple data centers (1) being able to satisfy reads locally, without incurring cross-datacenter latency, and (2) failure scenarios (2) Failure Scenarios. Asymmetrical replication groupings are also possible. For example, you can have three replicas per data center to serve real-time application requests and use a single replica for running analytics.
  • 9. NetworkTopologyStrategy determines replica placement independently within each data center as follows:  The first replica is placed according to the partitioner (same as with SimpleStrategy).  Additional replicas are placed by walking the ring clockwise until a node in a different rack is found. If no such node exists, additional replicas are placed in different nodes in the same rack.  NetworkTopologyStrategy attempts to place replicas on distinct racks because nodes in the same rack (or similar physical grouping) can fail at the same time due to power, cooling, or network issues.  Below is an example of how NetworkTopologyStrategy places replicas spanning two data centers with a total replication factor of 4. When using NetworkToplogyStrategy, you set the number of replicas per data center.    In the following graphic, notice the tokens are assigned to alternating racks. For more information, see Calculating Tokens for a Multiple Data Center Cluster.
  • 10.  Snitches Snitch maps IPs to racks and data centers. It defines how the nodes are grouped together within the overall network topology. Cassandra uses this information to route inter-node requests as efficiently as possible. A consistency level of ONE means that it is possible that 2 of the 3 replicas could miss the write if they happened to be down at the time the request was made If a replica misses a write, the row will be made consistent later via one of Cassandra's built-in repair mechanisms: hinted handoff, read repair or anti-entropy node repair. Cassandra's Built-in Consistency Repair Features Read Repair: To ensure that frequently-read data remains consistent, the coordinator compares the data from all the remaining replicas that own the row in the background, and if they are inconsistent, issues writes to the out-of-date replicas to update the row to reflect the most recently written values Anti-Entropy Node Repair: For data that is not read frequently, or to update data on a node that has been down for a while, the nodetool repair process ensures that all data on a replica is made consistent Hinted Handoff
  • 11. If a node happens to be down at the time of write, its corresponding replicas will save hints about the missed writes, and then handoff the affected rows once the node comes back online. Keyspaces: container for column families and a cluster has 1 keyspace per application. CREATE KEYSPACE keyspace_name WITH strategy_class = 'SimpleStrategy' AND strategy_options:replication_factor='2'; Single device per row - Time Series Pattern 1 Partitioning to limit row size - Time Series Pattern 2 The solution is to use a pattern called row partitioning by adding data to the row key to limit the amount of columns you get per device. Reverse order timeseries with expiring columns - Time Series Pattern 3 Data for a dashboard application and we only want to show the last 10 temperature readings. With TTL time to live for data value it is possible. CREATE TABLE latest_temperatures ( weatherstation_id text, event_time timestamp, temperature text, PRIMARY KEY (weatherstation_id,event_time), ) WITH CLUSTERING ORDER BY (event_time DESC); INSERT INTO latest_temperatures(weatherstation_id,event_time,temperature) VALUES ('1234ABCD','2013-04-03 07:03:00','72F') USING TTL 20;
  • 12. RDBMS Cassandra Stop service sudo service dse stop Justlike commitbutbefore commit. nodetool drain -h <host name> drain node before losing data.cassandra need not read commit log. Sys schema like oracle KeyspacesinCassandra SELECT * FROM system.schema_keyspaces; Counter Columns¶ A counter is a special kind of column used to store a number that incrementally counts the occurrences of a particular event or process. For example, you might use a counter column to count the number of times a page is viewed.