SlideShare a Scribd company logo
1 of 86
CHAPTER 06: Cassandra
History of Cassandra
• Apache Cassandra was born at Facebook for inbox
search. Facebook open sourced the code in 2008.
• Cassandra became an Apache Incubator project
in 2009 and subsequently became a top-level
Apache project in 2010.
• The latest version of Apache Cassandra is 3.1.1.
• It is a column-oriented database designed to support
peer-to-peer symmetric nodes instead of the master
slave architecture.
• It is built on Amazon’s dynamo and Google’s
BigTable.
cassandra ~= bigtable + dynamo
What is Cassandra?
• Apache Cassandra is a highly scalable, high-performance
distributed database designed to handle large amounts of
structured data across many commodity servers with
replication, providing high availability and no single point
of failure.
• circles are Cassandra nodes and lines between the
circles shows distributed architecture, while the client
is sending data to the node. (Ring Architecture)
Notable points
• It is scalable, fault-tolerant, and consistent.
• It is a column-oriented database.
• Its distribution design is based on Amazon’s Dynamo and
its data model on Google’s Bigtable.
• Cassandra implements a Dynamo-style replication model
with no single point of failure, but adds a more powerful
“column family” data model.
• Cassandra is being used by some of the biggest
companies such as Facebook, Twitter, Cisco, Rackspace,
ebay, Adobe, Twitter, Netflix, and more.
Features of Cassandra
• Elastic scalability - Cassandra is highly scalable; it allows
to add more hardware to accommodate more customers
and more data as per requirement.
• Massively Scalable Architecture: Cassandra has a
masterless design where all nodes are at the same level
which provides operational simplicity and easy scale out.
• Always on architecture (peer-to-peer
network): Cassandra replicates data on different nodes
that ensures no single point of failure and it is
continuously available for business-critical applications.
• Linear Scale Performance: As more nodes are added,
the performance of Cassandra increases. Therefore it
maintains a quick response time.
Features of Cassandra
• Flexible data storage - Cassandra accommodates all possible
data formats including: structured, semi-structured, and
unstructured. It can dynamically accommodate changes to
data structures according to the need.
• Easy data distribution - Cassandra provides the flexibility to
distribute data where you need by replicating data across
multiple data centers.
• Transaction support - Cassandra supports properties like
Atomicity, Consistency, Isolation, and Durability (ACID).
• Fast writes - Cassandra was designed to run on cheap
commodity hardware. It performs blazingly fast writes and
can store hundreds of terabytes of data, without sacrificing
the read efficiency.
Features of Cassandra
• Fault Detection and Recovery: Failed nodes can easily be
restored and recovered.
• Flexible and Dynamic Data Model: Supports datatypes
with Fast writes and reads.
• Data Protection: Data is protected with commit log
design and build in security like backup and restore
mechanisms.
• Tunable Data Consistency: Support for strong data
consistency across distributed architecture.
• Multi Data Center Replication: Cassandra provides
feature to replicate data across multiple data center.
Features of Cassandra
• Data Compression: Cassandra can compress up to 80%
data without any overhead.
• Cassandra Query language (CQL): Cassandra provides
query language that is similar like SQL language. It makes
very easy for relational database developers moving
from relational database to Cassandra.
Cassandra Use Cases/Application
• Messaging: Cassandra is a great database for the
companies that provides Mobile phones and messaging
services. These companies have a huge amount of data,
so Cassandra is best for them.
• Internet of things Application: Cassandra is a great
database for the applications where data is coming at
very high speed from different devices or sensors.
• Product Catalogs and retail apps: Cassandra is used by
many retailers for durable shopping cart protection and
fast product catalog input and output.
Cassandra Use Cases/Application
• Social Media Analytics and recommendation engine:
Cassandra is a great database for many online companies
and social media providers for analysis and
recommendation to their customers.
Cassandra Architecture
• The design goal of Cassandra is to handle big data
workloads across multiple nodes without any single
point of failure.
• Cassandra has peer-to-peer distributed system across its
nodes, and data is distributed among all the nodes in a
cluster.
Components of Cassandra
• Node − It is the basic fundamental unit of
Cassandra. Data stores in these
units(computer/server).
• Data center − It is a collection of related
nodes.
• Cassandra Rack- A rack is a unit that contains
all the multiple servers, all stacked on top of
another. A node is a single server in a rack.
• Cluster − A cluster is a component that
contains one or more data centers.
Components of Cassandra
• Commit log − The commit log is a crash-recovery
mechanism in Cassandra. Every write operation is
written to the commit log.
• Mem-table − A mem-table is a memory-resident
data structure. After commit log, the data will be
written to the mem-table.
• SSTable − It is a disk file to which the data is
flushed from the mem-table when its contents
reach a threshold value.
A rack is a group of
machines housed in the
same physical box. Each
machine in the rack has
its own CPU, memory,
and hard disk. However,
the rack has no CPU,
memory, or hard disk of
its own.
•All machines in the rack are
connected to the network switch
of the rack
•The rack’s network switch is
connected to the cluster.
•All machines on the rack have a
common power supply. It is
important to notice that a rack
can fail due to two reasons: a
network switch failure or a power
supply failure.
•If a rack fails, none of the
machines on the rack can be
accessed. So it would seem as
though all the nodes on the rack
are down.
Cassandra Cluster
Cassandra Architecture- Cassandra Cluster
Cassandra Architecture
• All the nodes in a cluster play the same role. Each node is
independent and at the same time interconnected to other
nodes.
• Each node in a cluster can accept read and write requests,
regardless of where the data is actually located in the cluster.
• When a node goes down, read/write requests can be served
from other nodes in the network.
Data Replication in Cassandra
• In Cassandra, one or more of the nodes in a
cluster act as replicas for a given piece of data.
• If it is detected that some of the nodes
responded with an out-of-date value,
Cassandra will return the most recent value to
the client. After returning the most recent
value, Cassandra performs a read repair in the
background to update the stale (old) values.
• The RF lies between 1 and n (# of nodes)
Gossip protocol
• Cassandra uses the Gossip Protocol in the
background to allow the nodes to communicate with
each other and detect any faulty nodes in the
cluster.
• A gossip protocol is a style of computer-to-
computer communication protocol inspired by the
form of gossip seen in social networks.
• The term epidemic protocol is sometimes used as a
synonym for a gossip protocol, because gossip
spreads information in a manner similar to the
spread of a virus in a biological community.
Partitioner
• Used for distributing data on the various nodes in
a cluster.
• It also determines the node on which to place the
very first copy of the data.
• It is a hash function
Replication Factor
• The total number of replicas across the cluster is
referred to as the replication factor.
• The RF determines the number of copies of data
(replicas) that will be stored across nodes in a
cluster.
• A replication strategy determines the nodes
where replicas are placed.
– Simple Strategy:
– Network Topology Strategy.
Simple Strategy
• Use only for a single datacenter and one rack.
• Simple Strategy places the first replica on a node
determined by the partitioner. Additional replicas
are placed on the next nodes clockwise in the
ring.
• Simple Strategy which is rack unaware and data
center unaware policy i.e. without considering
topology (rack or datacenter location).
Network Topology Strategy
• Network Topology Strategy is used when you have
more than two data centers.
• As the name indicates, this strategy is aware of the
network topology (location of nodes in racks, data
centers etc.) and is much intelligent than Simple
Strategy.
• This strategy specifies how many replicas you want in
each datacenter.
• Replicas are set for each data center separately. Rack
set of data for each data center place separately in a
clockwise direction on different racks of the same
data center. This process continues until it reaches
the first node.
Anti-Entropy
• Anti-entropy is a process of comparing the data of
all replicas and updating each replica to the
newest version.
• Frequent data deletions and node failures are
common causes of data inconsistency.
• Anti-entropy node repairs are important for every
Cassandra cluster.
• Anti-entropy repair is used for routine
maintenance and when a cluster needs fixing.
Writes path in Cassandra
• Cassandra processes data at several stages on the write path,
starting with the immediate logging of a write and ending in
compaction:
– Logging data in the commit log
– Writing data to the memtable
– Flushing data from the memtable
– Storing data on disk in SSTables
– Compaction
Hinted Handoffs
Hinted Handoffs
Depiction of hinted handoffs
Hint table
• Location of the node on which the replica is to be
placed.
• Version metadata
• The actual data
• When node C recovers and is back to the functional,
node A reacts to the hint by forwarding the data to node
C.
Tunable Consistency (T C)
• Consistency refers to how up-to-date and synchronized a
row of Cassandra data is on all of its replicas.
• Tunable consistency = Strong C + Eventual C
• Strong Consistency:
– Each update propagates to all locations, and it
ensures all server should have a copy of the data
before it serves to the client.
– It has impact performance.
Eventual Consistency
• It implies that the client is acknowledged with a success
as soon as a part of the cluster acknowledges the write.
• It is used when application performance matter.
Read consistency
• It means how many replicas must respond before
sending out the result to the client applications.
• Consistency levels : next slide
ONE Returns a response from the closest
node (replica)
holding the data.
QUORUM Returns a result from a quorum of
servers with the most recent timestamp
for the data.
LOCAL_QUORU
M
Returns a result from a quorum of
servers with the most recent timestamp
for the data in the same data center as the
coordinator node.
EACH_QUORUM Returns a result from a quorum of
servers with the
most recent timestamp in all data centers.
ALL This provides the highest level of
consistency of all levels. It responds to a
read request from a client after all the
replica nodes have responded.
Write consistency
• It means on how many replicas , write must succeed
before sending out an ACK to the client application.
• Write consistency levels: next slide
CQL DATA TYPES
CQLSH
• Cassandra provides Cassandra query language
shell (cqlsh) that allows users to communicate with
Cassandra.
• Using cqlsh, you can
• define a schema,
• insert data, and
• execute a query.
KEYSPACES (Database [Namespace])
• It is a container to hold application data like RDBMS.
• Used to group column families together.
• Each cluster has one keyspace/application or per
node.
• A keyspace (or key space) in a NoSQL data store is an
object that holds together all column families of a
design.
• It is the outermost grouping of the data in the data
store.
To create keyspace
CREATE KEYSPACE “KeySpace Name”
WITH replication = {'class': ‘Strategy name’,
'replication_factor' : ‘No.Of replicas’};
Details about existing Keyspaces
Describe keyspaces;
Select * from system.schema_keyspaces;
This gives more details
To use existing keyspace
Use keyspace;
Use students;
To create a column family or table by the name
“student_info”.
CREATE TABLE Student_Info ( RollNo int PRIMARY
KEY, StudName text, DateofJoining timestamp,
LastExamPercent double);
Other commands
Describe tables;
Describe table student_info;
CRUD
SELECT
To view the data from the table “student_info”.
SELECT * FROM student_info;
Select * from student_info where rollno in (1,2,3);
Index
T
o create an index on the “studname” column of the
“student_info” column family use the following
statement
CREATE INDEX ON student_info(studname);
Select * from student_info where StudName='Aviral';
Update
To update the value held in the “StudName” column of
the “student_info” column family to “David Sheen” for the
record where the RollNo column has value = 2.
Note: An update updates one or more column values for a
given row to the Cassandra table. It does not return
anything.
• UPDATE student_info SET StudName = 'Sharad' WHERE
RollNo = 3;
Delete
T
o delete the column “LastExamPercent” from the
“student_info” table for
the record where the RollNo = 2.
Note:Delete statement removes one or more columns
from one or more rows of a Cassandra table or
removes entire rows if no columns are specified.
DELETE LastExamPercent FROM student_info WHERE
RollNo=2;
Collections
• Cassandra provides collection types, used to group and
store data together in a column.
• E.g., grouping such a user's multiple email addresses.
• The values of items in a collection are limited to
64K.
• Collections can be used when you need to store the
following: Phone numbers of users and Email ids of
users.
Collections Set
• T
o alter the schema for the table “student_info” to
add a column “hobbies”.
ALTER TABLE student_info ADD hobbies set<text>;
UPDATE student_info SET hobbies = hobbies + {'Chess, Table
Tennis'} WHERE RollNo=4;
Collections List
• T
o alter the schema of the table “student_info” to
add a list column “language”.
ALTER TABLE student_info ADD language list<text>;
UPDATE student_info SET language = language + ['Hindi,
English'] WHERE RollNo=1;
Collections Map
• A map relates one item to another with a key-value pair.
Using the map type, you can store timestamp-related
information in user profiles.
• T
o alter the “Student_info” table to add a map
column “todo”.
• ALTER TABLE Student_info ADD todo map<timestamp,
text>;
Example
UPDATE student_info SET todo = { '2014-9-24':
'Cassandra Session', '2014-10-2 12:00' :
'MongoDB Session' } where rollno = 1;
Time To Live(TTL)
• Data in a column, other than a counter column, can
have an optional expiration period called TTL (time to
live).
• The client request may specify a TTL value for the
data. The TTL is specified in seconds.
Time To Live(TTL)
• CREATE TABLE userlogin(userid int primary key,
password text);
• INSERT INTO userlogin (userid, password) VALUES
(1,'infy') USING TTL 30;
• select * from userlogin;
Export to CSV
copy student_info( RollNo,StudName ,
DateofJoining, LastExamPercent) TO 'd:student.csv';
Import data from a CSV file
CREATE TABLE student_data ( id int PRIMARY KEY, fn text, ln
text,phone text, city text);
COPY student_data (id,fn,ln,phone,city) FROM
'd:cassandraDatastudent.csv';
Introduction to MapReduce Programming
(Revisit for details)
• In MapReduce Programming, Jobs (Applications) are
split into a set of map tasks and reduce tasks. Then these
tasks are executed in a distributed fashion on Hadoop
cluster.
• Each task processes small subset of data that has been
assigned to it. This way, Hadoop distributes the load
across the cluster.
• MapReduce job takes a set of files that is stored in
HDFS (Hadoop Distributed File System) as input.
Mapper
• The Map task takes care of loading, parsing,
transforming, and filtering.
• A mapper maps the input key-value pairs into a set of
intermediate key-value pairs.
• Maps are individual tasks that have the responsibility of
transforming input records into intermediate key-value
pairs. Each map task is broken into the following phases
• RecordReader
• Mapper/Maps
• Combiner
• partitioner
RecordReader
• RecordReader reads the data from inputsplit (record)
and converts them into key-value pair for the input to
the Mapper class.
Maps
• Map is a user-defined function, which takes a series of
key-value pairs and processes each one of them to
generate zero or more key-value pairs.
• Map takes a set of data and converts it into another set
of data. Input and output are key-value pairs.
Combiner
• A combiner is a type of local Reducer that groups similar
data from the map phase into new set of key-value pair.
• It is not a part of the main MapReduce algorithm;
• it is optional (may be part of mapper/map).
• The main function of a Combiner is to summarize the
map output records with the same key.
Difference between Combiner and Reducer
• Output generated by combiner is intermediate data and
is passed to the reducer.
• Output of the reducer is passed to the output file on the
disk.
Partitioner
• A partitioner partitions the key-value pairs of
intermediate Map-outputs.
• The Partitioner in MapReduce controls the partitioning
of the key of the intermediate mapper output.
• The partition phase takes place after the Map phase and
before the Reduce phase.
• The number of partitioner is equal to the number of
reducers. That means a partitioner will divide the data
according to the number of reducers. Therefore, the
data passed from a single partitioner is processed by a
single Reducer.
Partitioner
• And partitioner is created only when there are multiple
reducers.
Shuffling and Sorting in Hadoop MapReduce
• The process by which the intermediate output
from mappers is transferred to the reducer is called
Shuffling.
• Intermediated key-value generated by mapper is sorted
automatically by key.
Reduce
• The primary task of the Reducer is to reduce
a set of intermediate values (the ones that share
a common key) to a smaller set of values.
• The Reducer takes the grouped key-value paired
data as input and runs a Reducer function on each
one of them.
• Here, the data can be aggregated, filtered, and
combined in a number of ways, and it requires a
wide range of processing.
• The output of the reducer is the final output,
which is stored in HDFS
RecordWriter (Output format)
• RecordWriter writes output key-value pairs from the
Reducer phase to output files.
• OutputFormat instances provided by the Hadoop are
used to write files in HDFS. Thus the final output of
reducer is written on HDFS by OutputFormat instances
using RecordWriter.
cybersecurity notes for mca students for learning
cybersecurity notes for mca students for learning
cybersecurity notes for mca students for learning

More Related Content

Similar to cybersecurity notes for mca students for learning

Apache Cassandra For Java Developers - Why, What and How. LJC @ UCL October 2014
Apache Cassandra For Java Developers - Why, What and How. LJC @ UCL October 2014Apache Cassandra For Java Developers - Why, What and How. LJC @ UCL October 2014
Apache Cassandra For Java Developers - Why, What and How. LJC @ UCL October 2014
Johnny Miller
 
Big data and hadoop
Big data and hadoopBig data and hadoop
Big data and hadoop
Mohit Tare
 
Cassandra
CassandraCassandra
Cassandra
exsuns
 

Similar to cybersecurity notes for mca students for learning (20)

Why Cassandra?
Why Cassandra?Why Cassandra?
Why Cassandra?
 
Cassandra presentation
Cassandra presentationCassandra presentation
Cassandra presentation
 
Cassandra
CassandraCassandra
Cassandra
 
Highly available, scalable and secure data with Cassandra and DataStax Enterp...
Highly available, scalable and secure data with Cassandra and DataStax Enterp...Highly available, scalable and secure data with Cassandra and DataStax Enterp...
Highly available, scalable and secure data with Cassandra and DataStax Enterp...
 
5266732.ppt
5266732.ppt5266732.ppt
5266732.ppt
 
Cassandra Learning
Cassandra LearningCassandra Learning
Cassandra Learning
 
Cassandra - A Distributed Database System
Cassandra - A Distributed Database System Cassandra - A Distributed Database System
Cassandra - A Distributed Database System
 
Presentation of Apache Cassandra
Presentation of Apache Cassandra Presentation of Apache Cassandra
Presentation of Apache Cassandra
 
Data Lake and the rise of the microservices
Data Lake and the rise of the microservicesData Lake and the rise of the microservices
Data Lake and the rise of the microservices
 
System design fundamentals CAP.pdf
System design fundamentals CAP.pdfSystem design fundamentals CAP.pdf
System design fundamentals CAP.pdf
 
Apache Cassandra For Java Developers - Why, What and How. LJC @ UCL October 2014
Apache Cassandra For Java Developers - Why, What and How. LJC @ UCL October 2014Apache Cassandra For Java Developers - Why, What and How. LJC @ UCL October 2014
Apache Cassandra For Java Developers - Why, What and How. LJC @ UCL October 2014
 
Big data and hadoop
Big data and hadoopBig data and hadoop
Big data and hadoop
 
Column db dol
Column db dolColumn db dol
Column db dol
 
CASSANDRA - Next to RDBMS
CASSANDRA - Next to RDBMSCASSANDRA - Next to RDBMS
CASSANDRA - Next to RDBMS
 
Cassandra
CassandraCassandra
Cassandra
 
DataStax TechDay - Munich 2014
DataStax TechDay - Munich 2014DataStax TechDay - Munich 2014
DataStax TechDay - Munich 2014
 
Cassandra - A decentralized storage system
Cassandra - A decentralized storage systemCassandra - A decentralized storage system
Cassandra - A decentralized storage system
 
Aruman Cassandra database
Aruman Cassandra databaseAruman Cassandra database
Aruman Cassandra database
 
cassandra_presentation_final
cassandra_presentation_finalcassandra_presentation_final
cassandra_presentation_final
 
CASSANDRA A DISTRIBUTED NOSQL DATABASE FOR HOTEL MANAGEMENT SYSTEM
CASSANDRA A DISTRIBUTED NOSQL DATABASE FOR HOTEL MANAGEMENT SYSTEMCASSANDRA A DISTRIBUTED NOSQL DATABASE FOR HOTEL MANAGEMENT SYSTEM
CASSANDRA A DISTRIBUTED NOSQL DATABASE FOR HOTEL MANAGEMENT SYSTEM
 

Recently uploaded

Team Transformation Tactics for Holistic Testing and Quality (NewCrafts Paris...
Team Transformation Tactics for Holistic Testing and Quality (NewCrafts Paris...Team Transformation Tactics for Holistic Testing and Quality (NewCrafts Paris...
Team Transformation Tactics for Holistic Testing and Quality (NewCrafts Paris...
Lisi Hocke
 

Recently uploaded (20)

UNI DI NAPOLI FEDERICO II - Il ruolo dei grafi nell'AI Conversazionale Ibrida
UNI DI NAPOLI FEDERICO II - Il ruolo dei grafi nell'AI Conversazionale IbridaUNI DI NAPOLI FEDERICO II - Il ruolo dei grafi nell'AI Conversazionale Ibrida
UNI DI NAPOLI FEDERICO II - Il ruolo dei grafi nell'AI Conversazionale Ibrida
 
GraphSummit Milan - Neo4j: The Art of the Possible with Graph
GraphSummit Milan - Neo4j: The Art of the Possible with GraphGraphSummit Milan - Neo4j: The Art of the Possible with Graph
GraphSummit Milan - Neo4j: The Art of the Possible with Graph
 
Abortion Pill Prices Mthatha (@](+27832195400*)[ 🏥 Women's Abortion Clinic In...
Abortion Pill Prices Mthatha (@](+27832195400*)[ 🏥 Women's Abortion Clinic In...Abortion Pill Prices Mthatha (@](+27832195400*)[ 🏥 Women's Abortion Clinic In...
Abortion Pill Prices Mthatha (@](+27832195400*)[ 🏥 Women's Abortion Clinic In...
 
Abortion Clinic In Springs ](+27832195400*)[ 🏥 Safe Abortion Pills in Springs...
Abortion Clinic In Springs ](+27832195400*)[ 🏥 Safe Abortion Pills in Springs...Abortion Clinic In Springs ](+27832195400*)[ 🏥 Safe Abortion Pills in Springs...
Abortion Clinic In Springs ](+27832195400*)[ 🏥 Safe Abortion Pills in Springs...
 
Software Engineering - Introduction + Process Models + Requirements Engineering
Software Engineering - Introduction + Process Models + Requirements EngineeringSoftware Engineering - Introduction + Process Models + Requirements Engineering
Software Engineering - Introduction + Process Models + Requirements Engineering
 
Rapidoform for Modern Form Building and Insights
Rapidoform for Modern Form Building and InsightsRapidoform for Modern Form Building and Insights
Rapidoform for Modern Form Building and Insights
 
Community is Just as Important as Code by Andrea Goulet
Community is Just as Important as Code by Andrea GouletCommunity is Just as Important as Code by Andrea Goulet
Community is Just as Important as Code by Andrea Goulet
 
Incident handling is a clearly defined set of procedures to manage and respon...
Incident handling is a clearly defined set of procedures to manage and respon...Incident handling is a clearly defined set of procedures to manage and respon...
Incident handling is a clearly defined set of procedures to manage and respon...
 
Effective Strategies for Wix's Scaling challenges - GeeCon
Effective Strategies for Wix's Scaling challenges - GeeConEffective Strategies for Wix's Scaling challenges - GeeCon
Effective Strategies for Wix's Scaling challenges - GeeCon
 
Abortion Pill Prices Jane Furse ](+27832195400*)[ 🏥 Women's Abortion Clinic i...
Abortion Pill Prices Jane Furse ](+27832195400*)[ 🏥 Women's Abortion Clinic i...Abortion Pill Prices Jane Furse ](+27832195400*)[ 🏥 Women's Abortion Clinic i...
Abortion Pill Prices Jane Furse ](+27832195400*)[ 🏥 Women's Abortion Clinic i...
 
OpenChain Webinar: AboutCode and Beyond - End-to-End SCA
OpenChain Webinar: AboutCode and Beyond - End-to-End SCAOpenChain Webinar: AboutCode and Beyond - End-to-End SCA
OpenChain Webinar: AboutCode and Beyond - End-to-End SCA
 
Prompt Engineering - an Art, a Science, or your next Job Title?
Prompt Engineering - an Art, a Science, or your next Job Title?Prompt Engineering - an Art, a Science, or your next Job Title?
Prompt Engineering - an Art, a Science, or your next Job Title?
 
Transformer Neural Network Use Cases with Links
Transformer Neural Network Use Cases with LinksTransformer Neural Network Use Cases with Links
Transformer Neural Network Use Cases with Links
 
The Evolution of Web App Testing_ An Ultimate Guide to Future Trends.pdf
The Evolution of Web App Testing_ An Ultimate Guide to Future Trends.pdfThe Evolution of Web App Testing_ An Ultimate Guide to Future Trends.pdf
The Evolution of Web App Testing_ An Ultimate Guide to Future Trends.pdf
 
Evolving Data Governance for the Real-time Streaming and AI Era
Evolving Data Governance for the Real-time Streaming and AI EraEvolving Data Governance for the Real-time Streaming and AI Era
Evolving Data Governance for the Real-time Streaming and AI Era
 
Workshop - Architecting Innovative Graph Applications- GraphSummit Milan
Workshop -  Architecting Innovative Graph Applications- GraphSummit MilanWorkshop -  Architecting Innovative Graph Applications- GraphSummit Milan
Workshop - Architecting Innovative Graph Applications- GraphSummit Milan
 
CERVED e Neo4j su una nuvola, migrazione ed evoluzione di un grafo mission cr...
CERVED e Neo4j su una nuvola, migrazione ed evoluzione di un grafo mission cr...CERVED e Neo4j su una nuvola, migrazione ed evoluzione di un grafo mission cr...
CERVED e Neo4j su una nuvola, migrazione ed evoluzione di un grafo mission cr...
 
Lessons Learned from Building a Serverless Notifications System.pdf
Lessons Learned from Building a Serverless Notifications System.pdfLessons Learned from Building a Serverless Notifications System.pdf
Lessons Learned from Building a Serverless Notifications System.pdf
 
Navigation in flutter – how to add stack, tab, and drawer navigators to your ...
Navigation in flutter – how to add stack, tab, and drawer navigators to your ...Navigation in flutter – how to add stack, tab, and drawer navigators to your ...
Navigation in flutter – how to add stack, tab, and drawer navigators to your ...
 
Team Transformation Tactics for Holistic Testing and Quality (NewCrafts Paris...
Team Transformation Tactics for Holistic Testing and Quality (NewCrafts Paris...Team Transformation Tactics for Holistic Testing and Quality (NewCrafts Paris...
Team Transformation Tactics for Holistic Testing and Quality (NewCrafts Paris...
 

cybersecurity notes for mca students for learning

  • 2.
  • 3. History of Cassandra • Apache Cassandra was born at Facebook for inbox search. Facebook open sourced the code in 2008. • Cassandra became an Apache Incubator project in 2009 and subsequently became a top-level Apache project in 2010. • The latest version of Apache Cassandra is 3.1.1. • It is a column-oriented database designed to support peer-to-peer symmetric nodes instead of the master slave architecture. • It is built on Amazon’s dynamo and Google’s BigTable. cassandra ~= bigtable + dynamo
  • 4.
  • 5. What is Cassandra? • Apache Cassandra is a highly scalable, high-performance distributed database designed to handle large amounts of structured data across many commodity servers with replication, providing high availability and no single point of failure.
  • 6. • circles are Cassandra nodes and lines between the circles shows distributed architecture, while the client is sending data to the node. (Ring Architecture)
  • 7. Notable points • It is scalable, fault-tolerant, and consistent. • It is a column-oriented database. • Its distribution design is based on Amazon’s Dynamo and its data model on Google’s Bigtable. • Cassandra implements a Dynamo-style replication model with no single point of failure, but adds a more powerful “column family” data model. • Cassandra is being used by some of the biggest companies such as Facebook, Twitter, Cisco, Rackspace, ebay, Adobe, Twitter, Netflix, and more.
  • 8. Features of Cassandra • Elastic scalability - Cassandra is highly scalable; it allows to add more hardware to accommodate more customers and more data as per requirement. • Massively Scalable Architecture: Cassandra has a masterless design where all nodes are at the same level which provides operational simplicity and easy scale out. • Always on architecture (peer-to-peer network): Cassandra replicates data on different nodes that ensures no single point of failure and it is continuously available for business-critical applications. • Linear Scale Performance: As more nodes are added, the performance of Cassandra increases. Therefore it maintains a quick response time.
  • 9. Features of Cassandra • Flexible data storage - Cassandra accommodates all possible data formats including: structured, semi-structured, and unstructured. It can dynamically accommodate changes to data structures according to the need. • Easy data distribution - Cassandra provides the flexibility to distribute data where you need by replicating data across multiple data centers. • Transaction support - Cassandra supports properties like Atomicity, Consistency, Isolation, and Durability (ACID). • Fast writes - Cassandra was designed to run on cheap commodity hardware. It performs blazingly fast writes and can store hundreds of terabytes of data, without sacrificing the read efficiency.
  • 10. Features of Cassandra • Fault Detection and Recovery: Failed nodes can easily be restored and recovered. • Flexible and Dynamic Data Model: Supports datatypes with Fast writes and reads. • Data Protection: Data is protected with commit log design and build in security like backup and restore mechanisms. • Tunable Data Consistency: Support for strong data consistency across distributed architecture. • Multi Data Center Replication: Cassandra provides feature to replicate data across multiple data center.
  • 11. Features of Cassandra • Data Compression: Cassandra can compress up to 80% data without any overhead. • Cassandra Query language (CQL): Cassandra provides query language that is similar like SQL language. It makes very easy for relational database developers moving from relational database to Cassandra.
  • 12. Cassandra Use Cases/Application • Messaging: Cassandra is a great database for the companies that provides Mobile phones and messaging services. These companies have a huge amount of data, so Cassandra is best for them. • Internet of things Application: Cassandra is a great database for the applications where data is coming at very high speed from different devices or sensors. • Product Catalogs and retail apps: Cassandra is used by many retailers for durable shopping cart protection and fast product catalog input and output.
  • 13. Cassandra Use Cases/Application • Social Media Analytics and recommendation engine: Cassandra is a great database for many online companies and social media providers for analysis and recommendation to their customers.
  • 14. Cassandra Architecture • The design goal of Cassandra is to handle big data workloads across multiple nodes without any single point of failure. • Cassandra has peer-to-peer distributed system across its nodes, and data is distributed among all the nodes in a cluster.
  • 15.
  • 16.
  • 17. Components of Cassandra • Node − It is the basic fundamental unit of Cassandra. Data stores in these units(computer/server). • Data center − It is a collection of related nodes. • Cassandra Rack- A rack is a unit that contains all the multiple servers, all stacked on top of another. A node is a single server in a rack. • Cluster − A cluster is a component that contains one or more data centers.
  • 18. Components of Cassandra • Commit log − The commit log is a crash-recovery mechanism in Cassandra. Every write operation is written to the commit log. • Mem-table − A mem-table is a memory-resident data structure. After commit log, the data will be written to the mem-table. • SSTable − It is a disk file to which the data is flushed from the mem-table when its contents reach a threshold value.
  • 19. A rack is a group of machines housed in the same physical box. Each machine in the rack has its own CPU, memory, and hard disk. However, the rack has no CPU, memory, or hard disk of its own. •All machines in the rack are connected to the network switch of the rack •The rack’s network switch is connected to the cluster. •All machines on the rack have a common power supply. It is important to notice that a rack can fail due to two reasons: a network switch failure or a power supply failure. •If a rack fails, none of the machines on the rack can be accessed. So it would seem as though all the nodes on the rack are down.
  • 21. Cassandra Architecture • All the nodes in a cluster play the same role. Each node is independent and at the same time interconnected to other nodes. • Each node in a cluster can accept read and write requests, regardless of where the data is actually located in the cluster. • When a node goes down, read/write requests can be served from other nodes in the network.
  • 22. Data Replication in Cassandra • In Cassandra, one or more of the nodes in a cluster act as replicas for a given piece of data. • If it is detected that some of the nodes responded with an out-of-date value, Cassandra will return the most recent value to the client. After returning the most recent value, Cassandra performs a read repair in the background to update the stale (old) values. • The RF lies between 1 and n (# of nodes)
  • 23. Gossip protocol • Cassandra uses the Gossip Protocol in the background to allow the nodes to communicate with each other and detect any faulty nodes in the cluster. • A gossip protocol is a style of computer-to- computer communication protocol inspired by the form of gossip seen in social networks. • The term epidemic protocol is sometimes used as a synonym for a gossip protocol, because gossip spreads information in a manner similar to the spread of a virus in a biological community.
  • 24. Partitioner • Used for distributing data on the various nodes in a cluster. • It also determines the node on which to place the very first copy of the data. • It is a hash function
  • 25. Replication Factor • The total number of replicas across the cluster is referred to as the replication factor. • The RF determines the number of copies of data (replicas) that will be stored across nodes in a cluster. • A replication strategy determines the nodes where replicas are placed. – Simple Strategy: – Network Topology Strategy.
  • 26. Simple Strategy • Use only for a single datacenter and one rack. • Simple Strategy places the first replica on a node determined by the partitioner. Additional replicas are placed on the next nodes clockwise in the ring. • Simple Strategy which is rack unaware and data center unaware policy i.e. without considering topology (rack or datacenter location).
  • 27.
  • 28. Network Topology Strategy • Network Topology Strategy is used when you have more than two data centers. • As the name indicates, this strategy is aware of the network topology (location of nodes in racks, data centers etc.) and is much intelligent than Simple Strategy. • This strategy specifies how many replicas you want in each datacenter. • Replicas are set for each data center separately. Rack set of data for each data center place separately in a clockwise direction on different racks of the same data center. This process continues until it reaches the first node.
  • 29.
  • 30. Anti-Entropy • Anti-entropy is a process of comparing the data of all replicas and updating each replica to the newest version. • Frequent data deletions and node failures are common causes of data inconsistency. • Anti-entropy node repairs are important for every Cassandra cluster. • Anti-entropy repair is used for routine maintenance and when a cluster needs fixing.
  • 31.
  • 32. Writes path in Cassandra • Cassandra processes data at several stages on the write path, starting with the immediate logging of a write and ending in compaction: – Logging data in the commit log – Writing data to the memtable – Flushing data from the memtable – Storing data on disk in SSTables – Compaction
  • 33.
  • 34.
  • 38. Hint table • Location of the node on which the replica is to be placed. • Version metadata • The actual data • When node C recovers and is back to the functional, node A reacts to the hint by forwarding the data to node C.
  • 39. Tunable Consistency (T C) • Consistency refers to how up-to-date and synchronized a row of Cassandra data is on all of its replicas. • Tunable consistency = Strong C + Eventual C • Strong Consistency: – Each update propagates to all locations, and it ensures all server should have a copy of the data before it serves to the client. – It has impact performance.
  • 40. Eventual Consistency • It implies that the client is acknowledged with a success as soon as a part of the cluster acknowledges the write. • It is used when application performance matter.
  • 41. Read consistency • It means how many replicas must respond before sending out the result to the client applications. • Consistency levels : next slide
  • 42. ONE Returns a response from the closest node (replica) holding the data. QUORUM Returns a result from a quorum of servers with the most recent timestamp for the data. LOCAL_QUORU M Returns a result from a quorum of servers with the most recent timestamp for the data in the same data center as the coordinator node. EACH_QUORUM Returns a result from a quorum of servers with the most recent timestamp in all data centers. ALL This provides the highest level of consistency of all levels. It responds to a read request from a client after all the replica nodes have responded.
  • 43. Write consistency • It means on how many replicas , write must succeed before sending out an ACK to the client application. • Write consistency levels: next slide
  • 44.
  • 46.
  • 47. CQLSH • Cassandra provides Cassandra query language shell (cqlsh) that allows users to communicate with Cassandra. • Using cqlsh, you can • define a schema, • insert data, and • execute a query.
  • 48. KEYSPACES (Database [Namespace]) • It is a container to hold application data like RDBMS. • Used to group column families together. • Each cluster has one keyspace/application or per node. • A keyspace (or key space) in a NoSQL data store is an object that holds together all column families of a design. • It is the outermost grouping of the data in the data store.
  • 49.
  • 50.
  • 51. To create keyspace CREATE KEYSPACE “KeySpace Name” WITH replication = {'class': ‘Strategy name’, 'replication_factor' : ‘No.Of replicas’};
  • 52. Details about existing Keyspaces Describe keyspaces; Select * from system.schema_keyspaces; This gives more details
  • 53. To use existing keyspace Use keyspace; Use students;
  • 54. To create a column family or table by the name “student_info”. CREATE TABLE Student_Info ( RollNo int PRIMARY KEY, StudName text, DateofJoining timestamp, LastExamPercent double);
  • 56. CRUD
  • 57. SELECT To view the data from the table “student_info”. SELECT * FROM student_info; Select * from student_info where rollno in (1,2,3);
  • 58. Index T o create an index on the “studname” column of the “student_info” column family use the following statement CREATE INDEX ON student_info(studname); Select * from student_info where StudName='Aviral';
  • 59. Update To update the value held in the “StudName” column of the “student_info” column family to “David Sheen” for the record where the RollNo column has value = 2. Note: An update updates one or more column values for a given row to the Cassandra table. It does not return anything. • UPDATE student_info SET StudName = 'Sharad' WHERE RollNo = 3;
  • 60. Delete T o delete the column “LastExamPercent” from the “student_info” table for the record where the RollNo = 2. Note:Delete statement removes one or more columns from one or more rows of a Cassandra table or removes entire rows if no columns are specified. DELETE LastExamPercent FROM student_info WHERE RollNo=2;
  • 61. Collections • Cassandra provides collection types, used to group and store data together in a column. • E.g., grouping such a user's multiple email addresses. • The values of items in a collection are limited to 64K. • Collections can be used when you need to store the following: Phone numbers of users and Email ids of users.
  • 62. Collections Set • T o alter the schema for the table “student_info” to add a column “hobbies”. ALTER TABLE student_info ADD hobbies set<text>; UPDATE student_info SET hobbies = hobbies + {'Chess, Table Tennis'} WHERE RollNo=4;
  • 63. Collections List • T o alter the schema of the table “student_info” to add a list column “language”. ALTER TABLE student_info ADD language list<text>; UPDATE student_info SET language = language + ['Hindi, English'] WHERE RollNo=1;
  • 64. Collections Map • A map relates one item to another with a key-value pair. Using the map type, you can store timestamp-related information in user profiles. • T o alter the “Student_info” table to add a map column “todo”. • ALTER TABLE Student_info ADD todo map<timestamp, text>;
  • 65. Example UPDATE student_info SET todo = { '2014-9-24': 'Cassandra Session', '2014-10-2 12:00' : 'MongoDB Session' } where rollno = 1;
  • 66. Time To Live(TTL) • Data in a column, other than a counter column, can have an optional expiration period called TTL (time to live). • The client request may specify a TTL value for the data. The TTL is specified in seconds.
  • 67. Time To Live(TTL) • CREATE TABLE userlogin(userid int primary key, password text); • INSERT INTO userlogin (userid, password) VALUES (1,'infy') USING TTL 30; • select * from userlogin;
  • 68. Export to CSV copy student_info( RollNo,StudName , DateofJoining, LastExamPercent) TO 'd:student.csv';
  • 69. Import data from a CSV file CREATE TABLE student_data ( id int PRIMARY KEY, fn text, ln text,phone text, city text); COPY student_data (id,fn,ln,phone,city) FROM 'd:cassandraDatastudent.csv';
  • 70. Introduction to MapReduce Programming (Revisit for details) • In MapReduce Programming, Jobs (Applications) are split into a set of map tasks and reduce tasks. Then these tasks are executed in a distributed fashion on Hadoop cluster. • Each task processes small subset of data that has been assigned to it. This way, Hadoop distributes the load across the cluster. • MapReduce job takes a set of files that is stored in HDFS (Hadoop Distributed File System) as input.
  • 71. Mapper • The Map task takes care of loading, parsing, transforming, and filtering. • A mapper maps the input key-value pairs into a set of intermediate key-value pairs. • Maps are individual tasks that have the responsibility of transforming input records into intermediate key-value pairs. Each map task is broken into the following phases • RecordReader • Mapper/Maps • Combiner • partitioner
  • 72. RecordReader • RecordReader reads the data from inputsplit (record) and converts them into key-value pair for the input to the Mapper class.
  • 73.
  • 74. Maps • Map is a user-defined function, which takes a series of key-value pairs and processes each one of them to generate zero or more key-value pairs. • Map takes a set of data and converts it into another set of data. Input and output are key-value pairs.
  • 75. Combiner • A combiner is a type of local Reducer that groups similar data from the map phase into new set of key-value pair. • It is not a part of the main MapReduce algorithm; • it is optional (may be part of mapper/map). • The main function of a Combiner is to summarize the map output records with the same key.
  • 76. Difference between Combiner and Reducer • Output generated by combiner is intermediate data and is passed to the reducer. • Output of the reducer is passed to the output file on the disk.
  • 77.
  • 78. Partitioner • A partitioner partitions the key-value pairs of intermediate Map-outputs. • The Partitioner in MapReduce controls the partitioning of the key of the intermediate mapper output. • The partition phase takes place after the Map phase and before the Reduce phase. • The number of partitioner is equal to the number of reducers. That means a partitioner will divide the data according to the number of reducers. Therefore, the data passed from a single partitioner is processed by a single Reducer.
  • 79. Partitioner • And partitioner is created only when there are multiple reducers.
  • 80. Shuffling and Sorting in Hadoop MapReduce • The process by which the intermediate output from mappers is transferred to the reducer is called Shuffling. • Intermediated key-value generated by mapper is sorted automatically by key.
  • 81.
  • 82. Reduce • The primary task of the Reducer is to reduce a set of intermediate values (the ones that share a common key) to a smaller set of values. • The Reducer takes the grouped key-value paired data as input and runs a Reducer function on each one of them. • Here, the data can be aggregated, filtered, and combined in a number of ways, and it requires a wide range of processing. • The output of the reducer is the final output, which is stored in HDFS
  • 83. RecordWriter (Output format) • RecordWriter writes output key-value pairs from the Reducer phase to output files. • OutputFormat instances provided by the Hadoop are used to write files in HDFS. Thus the final output of reducer is written on HDFS by OutputFormat instances using RecordWriter.