 Kalpesh Pradhan (@kalpeshpradhan)
 Sr. Developer, Hungama Digital Media Pvt Ltd.
 Designed solution for migrating SQL database
to NoSQL
 Contributed skills in implementing Search
Engine using Apache Cassandra and Solr
 Designed solution for replacing social
analytics in-house using Apache Cassandra.
 Big data means collection of data sets so
large and complex that it becomes difficult to
process using on-hand database
management tools or traditional data
processing applications.
 In survey made by Gatner, limits on the size
of data sets that are feasible to process in a
reasonable amount of time were on the order
of Exabyte of data.
 1 Exabyte = 1 048 576 terabytes
 Science Field
◦ meteorology, genomics, connectomics, complex
physics simulations, and biological and
environmental research.
 Technology Field
◦ Internet Search, Social Networks, Server logs, User
action tracking on websites.
 Other Sources
◦ Stock Markets, e-commerce transactions.
 Big data is changing the way people within
organizations work together. It is creating a
culture in which business and IT leaders must
join forces to realize value from all data.
 Insights from big data can enable all
organizations to make
◦ Better decisions
◦ Deepening customer engagement
◦ Optimizing operations
◦ Preventing threats and fraud
• Data is emerging as the
world's newest resource for
competitive advantage.
• Analysis of data gives a
competitive edge to
organization to derive its
strategy.
• Example:
Presenting website as per
user history.
• Big data harnesses
organization to make certain
decisions in a smarter way.
•Example: Jubilant
FoodWorks
 Collects user information and order
 Analyzes the data
 Predicts when a particular user is coming
back to order
 Predicts what user can order on the basis of
past order.
 Harnesses the call center person with data
which can help costumer to order.
• Data is always collected as
Raw information
• Challenge is to derive
value out of collected data.
• Computing relevance from
the collected data is a
challenge.
Example: Targeting
customer with new credit
card scheme based on
transactional history.
 NoSQL, Not so SQL
 Apache Hadoop
 Apache Cassandra
 Map Reduce
 Apache Hbase
 Apache Hive
 Pig Latin
 Yahoo
On February 19, 2008, Yahoo! Inc. launched what it
claimed was the world's largest Hadoop production
application. The Yahoo! Search Webmap is a
Hadoop application that runs on a more than
10,000 core Linux cluster and produces data that is
used in every Yahoo! Web search query.
 Facebook
In 2010 Facebook claimed that they had the largest
Hadoop cluster in the world with 21 Peta Byte of
storage.
On June 13, 2012 they announced the data had
grown to 100 PetaByte. On November 8, 2012 they
announced the data gathered in the warehouse
grows by roughly half a PetaByte per day
1 PetaByte = 1048576
15
 Free Open Source
 No SQL
 Distributed database system
 Manage large amount of Structure, Semi-
structure , and Un Structure data
 Scale to a very large size across many
commodity servers with no single point of
failure
 Allows maximum flexibility and performance
at scale
16
17
 Memory
◦ Minimum of than 8GB of RAM is needed
◦ Recommended 16GB – 32GB
◦ Java heap space should be set to a maximum of 8GB
or half of your total RAM, whichever is lower. (A
greater heap size has more intense garbage collection
periods)
18
 CPU
◦ Insert-heavy workloads are CPU-bound in Cassandra
before becoming memory-bound.
◦ For dedicated hardware, 8-core processors are the
current price-performance sweet spot
19
 Disk
◦ Ideally Cassandra needs at least two disks,
 One for the commit log and
 Other for the data directories.
 At a minimum the commit log should be on its own
partition.
◦ Commit log disk
 Does not need to be large, but it should be fast
enough to receive all of your writes as appends
(sequential I/O).
20
 Disk
◦ Most workloads are best served by using less
expensive SATA disks and scaling disk capacity and
I/O by adding more nodes (with more RAM).
◦ use one or more disks and make sure they are large
enough for the data volume and fast enough to both
satisfy reads that are not cached in memory and to
keep up with compaction.
21
 Number of Nodes
◦ Using a greater number of smaller nodes is better
than using fewer larger nodes because of potential
bottlenecks on larger nodes during compaction.
22
 Network
◦ choose reliable, redundant network interfaces and
make sure that your network can handle traffic
between nodes without bottlenecks
 Recommended bandwith is 1000 Mbit/s (Gigabit) or
greater.
 Bind the Thrift interface (listen address) to a specific NIC
(Network Interface Card).
 Bind the Thrift interface (listen address) to a specific NIC
(Network Interface Card).
23
 Network Ports
Ops Center Specific
50031 OpsCenter HTTP proxy for Job Tracker
61620 OpsCenter intra-node monitoring port
61621 OpsCenter agent ports
Intranode Ports
1024+
Public Ports used
22 SSH
8888 Ops Center
7000 Cassandra intra-node port
9160 Cassandra Client port
24
 Calculating Data Size
◦ As with all data storage systems, the size of your raw
data will be larger once it is loaded into Cassandra
due to storage overhead.
◦ On average, raw data will be about 2 times larger on
disk after it is loaded into the database, but could be
much smaller or larger depending on the
characteristics of your data and column families.
25
 Column Overhead - Every column incurs 15
bytes of overhead. Since each row in a
column family can have different column
names as well as differing numbers of
columns, metadata is stored for each column.
For counter columns and expiring columns,
add an additional 8 bytes (23 bytes column
overhead). So the total size of
a regular column is:
◦ total_column_size = column_name_size +
column_value_size + 15
26
 Row Overhead - Just like columns, every row
also incurs some overhead when stored on
disk. Every row in Cassandra incurs 23 bytes of
overhead.
27
 Primary Key Index - Every column family also
maintains a primary index of its row keys.
Primary index overhead becomes more
significant when you have lots of skinny rows.
Sizing of the primary row key index can be
estimated as follows (in bytes):
◦ primary_key_index = number_of_rows * (32 +
average_key_size)
28
 Replication Overhead - The replication factor
obviously plays a role in how much disk
capacity is used. For a replication factor of 1,
there is no overhead for replicas (as only one
copy of your data is stored in the cluster). If
replication factor is greater than 1, then your
total data storage requirement will include
replication overhead.
◦ replication_overhead = total_data_size *
(replication_factor - 1)
29
 Java Prerequisites
◦ Before installing Cassandra on Linux, Windows, or
Mac, ensure that you have the most up-to-date
version of Java installed on your machine.
30
 Download the ”DataStax Community Edition
Server”, which is a bundle containing the most
up-to-date version of Cassandra along with all
the utilities and tools we will need. You can also
download directly from a terminal window
using wget on Linux or curl on Mac and the
following URL:
http://downloads.datastax.com/community/dsc.tar.gz
31
◦ KeySpace is the container for application data, similar
to a database or schema in a relational database.
◦ Inside the keyspace are one or more column
family objects, which are analogous to tables. Column
families contain columns, and a set of related
columns is identified by an application-supplied
row key. Each row in a column family is not required
to have the same set of columns
32
 Cassandra does not enforce relationships
between column families the way that relational
databases do between tables
 There are no formal foreign keys in Cassandra,
and joining column families at query time is
not supported.
 Each column family has a self-contained set of
columns that are intended to be accessed
together to satisfy specific queries from
application.
33
http://en.wikipedia.org/wiki/Apache_Cassandra
http://www.datastax.com/docs/1.0/index
http://www.cloudera.com
34
35

Big Data and its emergence

  • 2.
     Kalpesh Pradhan(@kalpeshpradhan)  Sr. Developer, Hungama Digital Media Pvt Ltd.  Designed solution for migrating SQL database to NoSQL  Contributed skills in implementing Search Engine using Apache Cassandra and Solr  Designed solution for replacing social analytics in-house using Apache Cassandra.
  • 3.
     Big datameans collection of data sets so large and complex that it becomes difficult to process using on-hand database management tools or traditional data processing applications.  In survey made by Gatner, limits on the size of data sets that are feasible to process in a reasonable amount of time were on the order of Exabyte of data.  1 Exabyte = 1 048 576 terabytes
  • 4.
     Science Field ◦meteorology, genomics, connectomics, complex physics simulations, and biological and environmental research.  Technology Field ◦ Internet Search, Social Networks, Server logs, User action tracking on websites.  Other Sources ◦ Stock Markets, e-commerce transactions.
  • 5.
     Big datais changing the way people within organizations work together. It is creating a culture in which business and IT leaders must join forces to realize value from all data.
  • 6.
     Insights frombig data can enable all organizations to make ◦ Better decisions ◦ Deepening customer engagement ◦ Optimizing operations ◦ Preventing threats and fraud
  • 7.
    • Data isemerging as the world's newest resource for competitive advantage. • Analysis of data gives a competitive edge to organization to derive its strategy. • Example: Presenting website as per user history.
  • 8.
    • Big dataharnesses organization to make certain decisions in a smarter way. •Example: Jubilant FoodWorks
  • 9.
     Collects userinformation and order  Analyzes the data  Predicts when a particular user is coming back to order  Predicts what user can order on the basis of past order.  Harnesses the call center person with data which can help costumer to order.
  • 10.
    • Data isalways collected as Raw information • Challenge is to derive value out of collected data. • Computing relevance from the collected data is a challenge. Example: Targeting customer with new credit card scheme based on transactional history.
  • 11.
     NoSQL, Notso SQL  Apache Hadoop  Apache Cassandra  Map Reduce  Apache Hbase  Apache Hive  Pig Latin
  • 12.
     Yahoo On February19, 2008, Yahoo! Inc. launched what it claimed was the world's largest Hadoop production application. The Yahoo! Search Webmap is a Hadoop application that runs on a more than 10,000 core Linux cluster and produces data that is used in every Yahoo! Web search query.
  • 13.
     Facebook In 2010Facebook claimed that they had the largest Hadoop cluster in the world with 21 Peta Byte of storage. On June 13, 2012 they announced the data had grown to 100 PetaByte. On November 8, 2012 they announced the data gathered in the warehouse grows by roughly half a PetaByte per day 1 PetaByte = 1048576
  • 15.
  • 16.
     Free OpenSource  No SQL  Distributed database system  Manage large amount of Structure, Semi- structure , and Un Structure data  Scale to a very large size across many commodity servers with no single point of failure  Allows maximum flexibility and performance at scale 16
  • 17.
  • 18.
     Memory ◦ Minimumof than 8GB of RAM is needed ◦ Recommended 16GB – 32GB ◦ Java heap space should be set to a maximum of 8GB or half of your total RAM, whichever is lower. (A greater heap size has more intense garbage collection periods) 18
  • 19.
     CPU ◦ Insert-heavyworkloads are CPU-bound in Cassandra before becoming memory-bound. ◦ For dedicated hardware, 8-core processors are the current price-performance sweet spot 19
  • 20.
     Disk ◦ IdeallyCassandra needs at least two disks,  One for the commit log and  Other for the data directories.  At a minimum the commit log should be on its own partition. ◦ Commit log disk  Does not need to be large, but it should be fast enough to receive all of your writes as appends (sequential I/O). 20
  • 21.
     Disk ◦ Mostworkloads are best served by using less expensive SATA disks and scaling disk capacity and I/O by adding more nodes (with more RAM). ◦ use one or more disks and make sure they are large enough for the data volume and fast enough to both satisfy reads that are not cached in memory and to keep up with compaction. 21
  • 22.
     Number ofNodes ◦ Using a greater number of smaller nodes is better than using fewer larger nodes because of potential bottlenecks on larger nodes during compaction. 22
  • 23.
     Network ◦ choosereliable, redundant network interfaces and make sure that your network can handle traffic between nodes without bottlenecks  Recommended bandwith is 1000 Mbit/s (Gigabit) or greater.  Bind the Thrift interface (listen address) to a specific NIC (Network Interface Card).  Bind the Thrift interface (listen address) to a specific NIC (Network Interface Card). 23
  • 24.
     Network Ports OpsCenter Specific 50031 OpsCenter HTTP proxy for Job Tracker 61620 OpsCenter intra-node monitoring port 61621 OpsCenter agent ports Intranode Ports 1024+ Public Ports used 22 SSH 8888 Ops Center 7000 Cassandra intra-node port 9160 Cassandra Client port 24
  • 25.
     Calculating DataSize ◦ As with all data storage systems, the size of your raw data will be larger once it is loaded into Cassandra due to storage overhead. ◦ On average, raw data will be about 2 times larger on disk after it is loaded into the database, but could be much smaller or larger depending on the characteristics of your data and column families. 25
  • 26.
     Column Overhead- Every column incurs 15 bytes of overhead. Since each row in a column family can have different column names as well as differing numbers of columns, metadata is stored for each column. For counter columns and expiring columns, add an additional 8 bytes (23 bytes column overhead). So the total size of a regular column is: ◦ total_column_size = column_name_size + column_value_size + 15 26
  • 27.
     Row Overhead- Just like columns, every row also incurs some overhead when stored on disk. Every row in Cassandra incurs 23 bytes of overhead. 27
  • 28.
     Primary KeyIndex - Every column family also maintains a primary index of its row keys. Primary index overhead becomes more significant when you have lots of skinny rows. Sizing of the primary row key index can be estimated as follows (in bytes): ◦ primary_key_index = number_of_rows * (32 + average_key_size) 28
  • 29.
     Replication Overhead- The replication factor obviously plays a role in how much disk capacity is used. For a replication factor of 1, there is no overhead for replicas (as only one copy of your data is stored in the cluster). If replication factor is greater than 1, then your total data storage requirement will include replication overhead. ◦ replication_overhead = total_data_size * (replication_factor - 1) 29
  • 30.
     Java Prerequisites ◦Before installing Cassandra on Linux, Windows, or Mac, ensure that you have the most up-to-date version of Java installed on your machine. 30
  • 31.
     Download the”DataStax Community Edition Server”, which is a bundle containing the most up-to-date version of Cassandra along with all the utilities and tools we will need. You can also download directly from a terminal window using wget on Linux or curl on Mac and the following URL: http://downloads.datastax.com/community/dsc.tar.gz 31
  • 32.
    ◦ KeySpace isthe container for application data, similar to a database or schema in a relational database. ◦ Inside the keyspace are one or more column family objects, which are analogous to tables. Column families contain columns, and a set of related columns is identified by an application-supplied row key. Each row in a column family is not required to have the same set of columns 32
  • 33.
     Cassandra doesnot enforce relationships between column families the way that relational databases do between tables  There are no formal foreign keys in Cassandra, and joining column families at query time is not supported.  Each column family has a self-contained set of columns that are intended to be accessed together to satisfy specific queries from application. 33
  • 34.
  • 35.