Hadoop+Cassandra_Integration

• Joyabrata Das
Hadoop and Cassandra Integration
An Open Source Approach

• Introduction
• Hadoop + Cassandra Integration
• System Description
• Implementation
• Business alternative
• Thank you
Contents
slide sequence

• Apache Hadoop software library is
a framework that allows for the
distributed processing of large data
sets across clusters of computers
using simple programming models
(http://hadoop.apache.org/).
• Apache Cassandra, is a
distributed storage system for
managing very large amounts of
structured data spread out across
many commodity servers, while
providing highly available service
with no single point of failure
(http://planetcassandra.org/).
Hadoop Cassandra
Introduction

• Hadoop consists of a Distributed File System (HDFS) at the core
and libraries to support Map Reduce model to write programs
(also by Hive queries or Pig scripts) to do analyze batch oriented
passive data.
• Cassandra is not a conventional database but is more like
Hashtable or HashMap which stores a key/value pair which
allows fast read/write which is crucial for real time data handling.
• While HDFS is a good solution for providing cost-effective storage
for Hadoop implementations devoted to data warehouse systems,
using Cassandra delivers the ability to run analytics on
Cassandra data that comes from line of business applications.
Hadoop + Cassandra Integration
Using Cassandra as backend storage instead of Hadoop Filesystem

System Description
• Setup four nodes Cassandra, Hadoop with Hive and Pig cluster in Ubuntu
Linux servers
• All four Cassandra nodes were connected through ring implementing
distributed data replication & no single point of failure to be monitored using
OpsCenter Dashboard.
• Pig partitioner was changed to match cassandra partitioner

Implementation
• Generate sample data to be loaded in .csv file
• Create sample keyspace (schema) & columnfamily (table) through
Cassandra Query Language (CQL)
• Write Pig script to format data file into required tuples and start loading
• After successful loading, the same data is read by simple CQL queries as
well as analyzed (using MapReduce or Pig)

Business alternative
• This open source approach of
integrating Hadoop with Cassandra
is now commercially available with
Datastax Enterprise Version.
• The Cassandra File System (CFS)
was designed by DataStax
Corporation to easily run analytics
on Cassandra data. Now
implemented as part of DataStax
Enterprise, which combines Apache
Cassandra, and Solr™ together into
a unified big data platform, CFS
provides the storage foundation that
makes running Hadoop-styled
analytics on Cassandra data hassle-
free.

Hadoop+Cassandra_Integration

Recommended

Recommended

More Related Content

What's hot

What's hot (20)

Viewers also liked

Viewers also liked (7)

Similar to Hadoop+Cassandra_Integration

Similar to Hadoop+Cassandra_Integration (20)

Hadoop+Cassandra_Integration