Using cassandra as a distributed logging to store pb data

CASSANDRA for BigData Event
logging
Ramesh Veeramani

PROBLEM SCENARIO
• Event logs and Permission logs Storage , Retrieval and Maintenance
poor performance and scalability issue
• Currently HBase and Oracle DB used.
• Need for better storage and query capability
• Data are immutable.
• Eventual consistency acceptable.
• New Infra to scale to 1 PB data.

Option n Cassandra
• Why Cassandra ?
• How with Cassandra?
• Internal working
• Setup
• Tools
• Maintenance
• Cassandra constraints
• Comparison

Why Cassandra
• Scales Incrementally
• Highly Available
• Uses Peer – Peer topology ; instead of naive Master slave.
• Good for OLTP and for data changes.
• Availability a high priority
• Write Throughput higher than Read Throughput.
• Lot like Relational model easy to develop.
• Well established since 2008

Cassandra - Internal
• Tokens and Hashing
• Virtual nodes
• Allows for equal CPU utilization for all the server in case of removing and adding nodes.
• Token ID assignments automatic.
• Configured in Cassandra.yaml [num_nodes =256]
• Replication
• Gossip / Snitch
• Read and Write ?
• Compaction Strategy.
• Add and Remove nodes

Constraints with Replication
• Consistency an issue.
• Aims for Eventual Consistency.
• Read = Write Possible with strict consistency.
• Configurable Consistency to different level
• ConfigurationLevel = {All, Quorum, One}
• Discretion of the Co-Ordinator node to enforces Replication and CL
• There is possibility of stale data in production
• Operation effort to synchronize the data (nodetool repair <node>)
• Synchonizes the data on each node is timely operation to be done

Gossip /Snitch
• Cassandra uses Gossip Protocol instead of naïve ping or other comm
protocol
• Gossip is epidemic and probabilistic protocol .
• Gossip is not deterministic.

Why read are relatively slower….
• Retrieving of rows and columns from the datastore.
• If all columns present in the MemTable. The results are returned.
• If data not found control pass through the SSTable in order of entry.
• Bloom filter expedites the search in SSTable.
• BF is a Hash table Datastructure signifying if criteria in the SSTable

COMPACTION STATERGY
• Four compaction strategy
• Size
• Data
• Time
• Level [** Recommended for read intensive workload ]

SETUP
• 2 NODE EACH
• processor Intel(R) Xeon(R) CPU E5-2683 v4 @ 2.10GHz
• processor CPU
• memory 8188MiB System Memory
• memory 8188MiB DIMM RAM
• CentOS Linux release 7.5.1804 (Core)
• sudo vim /etc/yum.repos.d/cassandra.repo
[cassandra]
name=Apache Cassandra
baseurl=https://www.apache.org/dist/cassandra/redhat/311x/
gpgcheck=1
repo_gpgcheck=1
gpgkey=https://www.apache.org/dist/cassandra/KEYS
• yum -y install cassandra
• systemctl start cassandra
• systemctl enable cassandra
• sudo systemctl enable cassandra
• nodetool status
*** Extra configuration needed to setup MULTI NODE CLUSTER

CREATING A KEYSPACE. (A.K.A Db)
CREATE KEYSPACE events WITH
REPLICATION = {‘class’:’SimpleStratergy’,’replication_factor’:1};

CREATE TABLE
• CREATE TABLE events (
ID UUID,
USER_TYPE text,
ACCOUNT_ID text,
CLASS_NAME text,
CREATE_DATE timestamp,
PRIMARY KEY((ID),ACCOUNT_ID)
);

CREATING SASI (SECONDARY INDEX)
• CREATE CUSTOM INDEX class_user ON events (class_name,user_type)
USING 'org.apache.cassandra.index.sasi.SASIIndex’
WITH OPTIONS ={'mode':'contains’};
• CREATE CUSTOM INDEX user_index ON events
(user_type)
USING 'org.apache.cassandra.index.sasi.SASIIndex’
WITH OPTIONS ={'mode':'contains'};

CREATING MATERIALIZED VIEW
CREATE MATERIALIZED VIEW MV AS
SELECT account_id, class_name,id,create_date,user_type from events2
where account_id='MYSPACE' AND user_type= 'VIP’
PRIMARY KEY ((user_type,class_name),account_id);
CREATE MATERIALIZED VIEW MV AS
SELECT account_id, class_name,id,create_date,user_type from events2 where
account_id='MYSPACE' and user_type is not null and class_name is not null
PRIMARY KEY ((user_type,class_name),account_id);

Benchmarking
• A million record in 444 seconds when it’s a single threaded
sequential.
• A million record in 240 seconds for 2 concurrent threads that are
sequential

Maintenance tools
• calls – python-based tool to query cassandra using CQL (Cassandra's
query language)
• cassandra-stress – benchmarking tool
• nodetool – command line administration tool that uses JMX to get
operational information from Cassandra nodes and to kick off
administration tasks (repair, compaction, cleanup)
• DSE OpsCenter DSE *– Visual monitoring with Enterprise license
• Paid Monitoring

Conclusion
• A good candidate for logging system
• Easy to scale.
• Native driver for PHP
• Data model design to be given lot of thought
• Data query to be known and designed as per application

Using cassandra as a distributed logging to store pb data

Recommended

Recommended

More Related Content

What's hot

What's hot (20)

Similar to Using cassandra as a distributed logging to store pb data

Similar to Using cassandra as a distributed logging to store pb data (20)

Recently uploaded

Recently uploaded (20)

Using cassandra as a distributed logging to store pb data