Hadoop presentation

BIG DATA and Hadoop
By
Chandra Sekhar


Introduction to BigData

What is hadoop?

What hadoop is used for and is not?

Top level Hadoop Projects

Differences between RDBMS and Hbase.

Facebook server model.

BigData- The Data Age

Big data is a collection of datasets so large and
complex that it becomes difficult to process using on-
hand database management tools or traditional data
processing applications.

The challenges include capture, curation, storage,
search, sharing, transfer, analysis, and visualization.

The data that is getting generated by different
companies has an inherent value, can be used for
different use cases in their analytics and predictions.

A new approach

As per Moore's Law, which was true for the past 40 years.
1) Processing power doubles every two years
2) Processing speed is no longer the problem.

Getting the data to the processors becomes the bottleneck.
Average time taken to transfer 100GB of data takes 22 min, if
the disk transfer rate is 75 MB/sec

So, the new approach is to move processing of the data to the
data side in a distributed way, and need to satisfy the different
requirements like : Data Recoverability, Component Recovery,
Consistency, Reliability and Scalability.

The answer is the Google's File System(GFS) and
MapReduce, which is now Hadoop called HDFS and
MapReduce.

Hadoop used for.

Hadoop is recommended to coexist with your RDBMS as a
data ware house.

It is not a replacement to any of the RDBMS.

Processing over TB and PB of data is specified to take hours
of time with traditional methods, with Hadoop and and it's eco-
system it would take a few minutes with the power of
distribution.

Many related tools integrate with Hadoop –

Data"analysis”

Data"visualization"

Database"integration"

Workflow"management"

Cluster"management"

➲ Distributed File system and parallel processing for large scale
data operations using HDFS and MapReduce.
➲ Plus the infrastructure needed to make them work, include

Filesystem utilities

Job scheduling and monitoring

Web UI

There are many other projects running around the core
components of Hadoop. Pig, Hive, HBase, Flume, Oozie,
Sqoop, etc called as Ecosystem.

A set of machines running HDFS and MapReduce is known
as Hadoop Cluster.

Individual machines are known as nodes – A cluster
can have as few as one node, as many as several
thousands , horizontally scalable.

More nodes = better performance!
Hadoop and EcoSystem

Hadoop-Components

HDFS and MapReduce-
Core

ZooKeeper-Admin

Hive,Pig – SQL and scripts
based on MapReduce

Hbase is NoSQL Datastore.

Sqoop- import to and export
data from RDBMS.

Avro - Serialization based on
JSON. Used for metadata
store.

Hadoop Components: HDFS

HDFS, the Hadoop Distributed File System, is responsible for
storing data on the cluster. Uses Ext3/Ext4 or xfs file system.

HDFS is a file-system designed for storing very large files with
streaming data-acess(write-once, read many time), running on
clusters of commodity hardware.

Data is split into blocks and distributed across mul/ple nodes in the
cluster

Each block is typically 64MB or 128MB in size

Each block is replicated multiple times

Default is to replicate each block three times

Replicas are stored on different nodes

This ensures both reliability and availability.

HDFS and MapReduce

NameNode(Master)

SecondaryNameNode

Master FailoverNode

Data Nodes (SlaveNodes).

JobTracker

Jobs

Task Tracker

Tasks

Mapper

Reducer

Combiner

Partitioner

HDFS Access
•
WebHDFS – REST API
•
Fuse DFS – Mounting HDFS as normal
drive.
•
Direct Access – Direct HDFS access

Hive and Pig

Hive is a powerful SQL language, though not
fully supported SQL, can be used to perform
joins on top of datasets in HDFS.

Used for large batch Programming. At the
backend, hive does the MapReduce Jobs only.

Pig is a powerful scripting language, that is
built on top of the MapReduce Jobs, the
language is called PigLatin.

HBASE

The most powerful NoSQL database on earth.

Supports Master Active-Active Setup and is
based on the Google's BigTable.

Supports Columns and ColumnFamilies, can
support many billions of rows and many
millions of columns in its datamodel.

An excellent Architectural master-piece, as far
as the scalability is concerned.

A NoSQL database, which can support
transactions, very fast reads/writes typically
millions of queries / second.

HBASE-Continued

Hbase Master

Region Servers

ZooKeepers

HDFS

ZooKeeper, Mahout

Zookeeper is a distributed coordinator and can
be used as independent package, in any
distributed servers management.

Mahout is a machine learning tool useful for
using it for various Data science techniques.
For eg: Data Clustering, Classification and
Recommender Systems by using Supervised
and Unsupervised Learning.

Flume

Flume is a real time data access mechanism
and writes to a data mart.

Flume can move large capacity of streaming
data into HDFS and will be used for further
analysis.

A part from this realtime analysis of the web-
log data is also possible along with Flume.

Logs of a group of webservers can be written
to HDFS using Flume.

Sqoop and Oozie

Sqoop is a data import and export mechanism
from RDBMS to HDFS or hive and vice-versa.

There are lot of free connectors that has been
prepared by various vendors with different
RDBMS, which has really made the data
transfer very fast, as it supports parallel
transfer of stuff.

Oozie is a workflow, mechanism of executing
a large sequence of MapReduce Jobs, Hive or
Pig Jobs and Hbase Jobs and any other Java
Programs. Oozie also has an email job which

RDBMS vs HBASE
A typical RDBMS scaling story runs this way:

Initial Public Launch

Service Popular, too many reads hitting database.

Service continues to grow in popularity; too many writes hitting
the database.

New features increases query complexity; now we have too
many joins

Rising popularity swamps the server; things are too slow

Some queries are still too slow

Reads are OK, but writes are getting slower and slower

With Hbase
Enter HBase, which has the following characteristics:

No real indexes.

Automatic partitioning/Sharding

Scale linearly and automatically with new nodes

Commodity hardware

Fault tolerance

Batch processing

Hadoop presentation

More Related Content

What's hot

Viewers also liked

Similar to Hadoop presentation

Recently uploaded

Hadoop presentation