BIG DATA and Hadoop
By
Chandra Sekhar
Contents

Introduction to BigData

What is hadoop?

What hadoop is used for and is not?

Top level Hadoop Projects

Differences between RDBMS and Hbase.

Facebook server model.
BigData- The Data Age

Big data is a collection of datasets so large and
complex that it becomes difficult to process using on-
hand database management tools or traditional data
processing applications.

The challenges include capture, curation, storage,
search, sharing, transfer, analysis, and visualization.

The data that is getting generated by different
companies has an inherent value, can be used for
different use cases in their analytics and predictions.
A new approach

As per Moore's Law, which was true for the past 40 years.
1) Processing power doubles every two years
2) Processing speed is no longer the problem.

Getting the data to the processors becomes the bottleneck.
Average time taken to transfer 100GB of data takes 22 min, if
the disk transfer rate is 75 MB/sec

So, the new approach is to move processing of the data to the
data side in a distributed way, and need to satisfy the different
requirements like : Data Recoverability, Component Recovery,
Consistency, Reliability and Scalability.

The answer is the Google's File System(GFS) and
MapReduce, which is now Hadoop called HDFS and
MapReduce.
Hadoop used for.

Hadoop is recommended to coexist with your RDBMS as a
data ware house.

It is not a replacement to any of the RDBMS.

Processing over TB and PB of data is specified to take hours
of time with traditional methods, with Hadoop and and it's eco-
system it would take a few minutes with the power of
distribution.

Many related tools integrate with Hadoop –

Data"analysis”

Data"visualization"

Database"integration"

Workflow"management"

Cluster"management"
➲ Distributed File system and parallel processing for large scale
data operations using HDFS and MapReduce.
➲ Plus the infrastructure needed to make them work, include

Filesystem utilities

Job scheduling and monitoring

Web UI

There are many other projects running around the core
components of Hadoop. Pig, Hive, HBase, Flume, Oozie,
Sqoop, etc called as Ecosystem.

A set of machines running HDFS and MapReduce is known
as Hadoop Cluster.

Individual machines are known as nodes – A cluster
can have as few as one node, as many as several
thousands , horizontally scalable.

More nodes = better performance!
Hadoop and EcoSystem
Hadoop-Components

HDFS and MapReduce-
Core

ZooKeeper-Admin

Hive,Pig – SQL and scripts
based on MapReduce

Hbase is NoSQL Datastore.

Sqoop- import to and export
data from RDBMS.

Avro - Serialization based on
JSON. Used for metadata
store.
Hadoop Components: HDFS

HDFS, the Hadoop Distributed File System, is responsible for
storing data on the cluster. Uses Ext3/Ext4 or xfs file system.

HDFS is a file-system designed for storing very large files with
streaming data-acess(write-once, read many time), running on
clusters of commodity hardware.

Data is split into blocks and distributed across mul/ple nodes in the
cluster

Each block is typically 64MB or 128MB in size

Each block is replicated multiple times

Default is to replicate each block three times

Replicas are stored on different nodes

This ensures both reliability and availability.
HDFS and MapReduce

NameNode(Master)

SecondaryNameNode

Master FailoverNode

Data Nodes (SlaveNodes).

JobTracker

Jobs

Task Tracker

Tasks

Mapper

Reducer

Combiner

Partitioner
HDFS and Nodes
Architecture
MapReduce
HDFS Access
•
WebHDFS – REST API
•
Fuse DFS – Mounting HDFS as normal
drive.
•
Direct Access – Direct HDFS access
Hive and Pig

Hive is a powerful SQL language, though not
fully supported SQL, can be used to perform
joins on top of datasets in HDFS.

Used for large batch Programming. At the
backend, hive does the MapReduce Jobs only.

Pig is a powerful scripting language, that is
built on top of the MapReduce Jobs, the
language is called PigLatin.
HBASE

The most powerful NoSQL database on earth.

Supports Master Active-Active Setup and is
based on the Google's BigTable.

Supports Columns and ColumnFamilies, can
support many billions of rows and many
millions of columns in its datamodel.

An excellent Architectural master-piece, as far
as the scalability is concerned.

A NoSQL database, which can support
transactions, very fast reads/writes typically
millions of queries / second.
HBASE-Continued

Hbase Master

Region Servers

ZooKeepers

HDFS
ZooKeeper, Mahout

Zookeeper is a distributed coordinator and can
be used as independent package, in any
distributed servers management.

Mahout is a machine learning tool useful for
using it for various Data science techniques.
For eg: Data Clustering, Classification and
Recommender Systems by using Supervised
and Unsupervised Learning.
Flume

Flume is a real time data access mechanism
and writes to a data mart.

Flume can move large capacity of streaming
data into HDFS and will be used for further
analysis.

A part from this realtime analysis of the web-
log data is also possible along with Flume.

Logs of a group of webservers can be written
to HDFS using Flume.
Sqoop and Oozie

Sqoop is a data import and export mechanism
from RDBMS to HDFS or hive and vice-versa.

There are lot of free connectors that has been
prepared by various vendors with different
RDBMS, which has really made the data
transfer very fast, as it supports parallel
transfer of stuff.

Oozie is a workflow, mechanism of executing
a large sequence of MapReduce Jobs, Hive or
Pig Jobs and Hbase Jobs and any other Java
Programs. Oozie also has an email job which
RDBMS vs HBASE
A typical RDBMS scaling story runs this way:

Initial Public Launch

Service Popular, too many reads hitting database.

Service continues to grow in popularity; too many writes hitting
the database.

New features increases query complexity; now we have too
many joins

Rising popularity swamps the server; things are too slow

Some queries are still too slow

Reads are OK, but writes are getting slower and slower
With Hbase
Enter HBase, which has the following characteristics:

No real indexes.

Automatic partitioning/Sharding

Scale linearly and automatically with new nodes

Commodity hardware

Fault tolerance

Batch processing
Facebook Server Architecture

Hadoop presentation

  • 1.
    BIG DATA andHadoop By Chandra Sekhar
  • 2.
  • 3.
     Introduction to BigData  Whatis hadoop?  What hadoop is used for and is not?  Top level Hadoop Projects  Differences between RDBMS and Hbase.  Facebook server model.
  • 4.
    BigData- The DataAge  Big data is a collection of datasets so large and complex that it becomes difficult to process using on- hand database management tools or traditional data processing applications.  The challenges include capture, curation, storage, search, sharing, transfer, analysis, and visualization.  The data that is getting generated by different companies has an inherent value, can be used for different use cases in their analytics and predictions.
  • 5.
    A new approach  Asper Moore's Law, which was true for the past 40 years. 1) Processing power doubles every two years 2) Processing speed is no longer the problem.  Getting the data to the processors becomes the bottleneck. Average time taken to transfer 100GB of data takes 22 min, if the disk transfer rate is 75 MB/sec  So, the new approach is to move processing of the data to the data side in a distributed way, and need to satisfy the different requirements like : Data Recoverability, Component Recovery, Consistency, Reliability and Scalability.  The answer is the Google's File System(GFS) and MapReduce, which is now Hadoop called HDFS and MapReduce.
  • 6.
    Hadoop used for.  Hadoopis recommended to coexist with your RDBMS as a data ware house.  It is not a replacement to any of the RDBMS.  Processing over TB and PB of data is specified to take hours of time with traditional methods, with Hadoop and and it's eco- system it would take a few minutes with the power of distribution.  Many related tools integrate with Hadoop –  Data"analysis”  Data"visualization"  Database"integration"  Workflow"management"  Cluster"management"
  • 7.
    ➲ Distributed Filesystem and parallel processing for large scale data operations using HDFS and MapReduce. ➲ Plus the infrastructure needed to make them work, include  Filesystem utilities  Job scheduling and monitoring  Web UI  There are many other projects running around the core components of Hadoop. Pig, Hive, HBase, Flume, Oozie, Sqoop, etc called as Ecosystem.  A set of machines running HDFS and MapReduce is known as Hadoop Cluster.  Individual machines are known as nodes – A cluster can have as few as one node, as many as several thousands , horizontally scalable.  More nodes = better performance! Hadoop and EcoSystem
  • 8.
    Hadoop-Components  HDFS and MapReduce- Core  ZooKeeper-Admin  Hive,Pig– SQL and scripts based on MapReduce  Hbase is NoSQL Datastore.  Sqoop- import to and export data from RDBMS.  Avro - Serialization based on JSON. Used for metadata store.
  • 9.
    Hadoop Components: HDFS  HDFS,the Hadoop Distributed File System, is responsible for storing data on the cluster. Uses Ext3/Ext4 or xfs file system.  HDFS is a file-system designed for storing very large files with streaming data-acess(write-once, read many time), running on clusters of commodity hardware.  Data is split into blocks and distributed across mul/ple nodes in the cluster  Each block is typically 64MB or 128MB in size  Each block is replicated multiple times  Default is to replicate each block three times  Replicas are stored on different nodes  This ensures both reliability and availability.
  • 10.
    HDFS and MapReduce  NameNode(Master)  SecondaryNameNode  MasterFailoverNode  Data Nodes (SlaveNodes).  JobTracker  Jobs  Task Tracker  Tasks  Mapper  Reducer  Combiner  Partitioner
  • 11.
  • 12.
  • 13.
  • 14.
    HDFS Access • WebHDFS –REST API • Fuse DFS – Mounting HDFS as normal drive. • Direct Access – Direct HDFS access
  • 15.
    Hive and Pig  Hiveis a powerful SQL language, though not fully supported SQL, can be used to perform joins on top of datasets in HDFS.  Used for large batch Programming. At the backend, hive does the MapReduce Jobs only.  Pig is a powerful scripting language, that is built on top of the MapReduce Jobs, the language is called PigLatin.
  • 16.
    HBASE  The most powerfulNoSQL database on earth.  Supports Master Active-Active Setup and is based on the Google's BigTable.  Supports Columns and ColumnFamilies, can support many billions of rows and many millions of columns in its datamodel.  An excellent Architectural master-piece, as far as the scalability is concerned.  A NoSQL database, which can support transactions, very fast reads/writes typically millions of queries / second.
  • 17.
  • 18.
    ZooKeeper, Mahout  Zookeeper isa distributed coordinator and can be used as independent package, in any distributed servers management.  Mahout is a machine learning tool useful for using it for various Data science techniques. For eg: Data Clustering, Classification and Recommender Systems by using Supervised and Unsupervised Learning.
  • 19.
    Flume  Flume is areal time data access mechanism and writes to a data mart.  Flume can move large capacity of streaming data into HDFS and will be used for further analysis.  A part from this realtime analysis of the web- log data is also possible along with Flume.  Logs of a group of webservers can be written to HDFS using Flume.
  • 20.
    Sqoop and Oozie  Sqoopis a data import and export mechanism from RDBMS to HDFS or hive and vice-versa.  There are lot of free connectors that has been prepared by various vendors with different RDBMS, which has really made the data transfer very fast, as it supports parallel transfer of stuff.  Oozie is a workflow, mechanism of executing a large sequence of MapReduce Jobs, Hive or Pig Jobs and Hbase Jobs and any other Java Programs. Oozie also has an email job which
  • 21.
    RDBMS vs HBASE Atypical RDBMS scaling story runs this way:  Initial Public Launch  Service Popular, too many reads hitting database.  Service continues to grow in popularity; too many writes hitting the database.  New features increases query complexity; now we have too many joins  Rising popularity swamps the server; things are too slow  Some queries are still too slow  Reads are OK, but writes are getting slower and slower
  • 22.
    With Hbase Enter HBase,which has the following characteristics:  No real indexes.  Automatic partitioning/Sharding  Scale linearly and automatically with new nodes  Commodity hardware  Fault tolerance  Batch processing
  • 23.