2. TABLE OF CONTENT
INTRODUCTION OF BIG DATA
INTRODUCTION OF HADOOP
MAP REDUCE
HDFS
HBASE
HIVE
SQOOP
APPLICATION OF BIG DATA & HADOOP
ADVANTAGES & LIMITATIONS
PLATFORM USES
3. BIG DATA
“90% of the world’s data was generated in the last few years.”
Big Data is a collection of large datasets (Terabyte or Petabyte) that cannot be
processed using traditional computing techniques. It is not a single technique
or a tool, rather it involves many areas of business and technology.
Thus Big Data includes huge volume, high velocity, and extensible
variety of data. The data in it will be of three types.
Structured data: Relational data.
Semi Structured data: XML data.
Unstructured data: Word, PDF, Text, Media Logs.
4. HADOOP
Hadoop is a open source framework or tool that allows to store and process
big data. Hadoop is an Apache open source framework written in java that
allows distributed processing of large datasets across clusters of computers
using simple programming models.
Hadoop Architecture
At its core, Hadoop has two major layers namely:
(a) Processing/Computation layer (MapReduce), and
(b) Storage layer (Hadoop Distributed File System).
6. MAP REDUCE
MapReduce is a parallel programming model for writing
Distributed applications devised at Google for efficient
processing of large amounts of data (multi-terabyte data-sets), on
large clusters (thousands of nodes) of commodity hardware in a
reliable, fault-tolerant manner. The MapReduce program runs on
Hadoop which is an Apache open-source framework.
7. HDFS
The Hadoop Distributed File System (HDFS) is based on the Google File
System (GFS) and provides a distributed file system that is designed to run on
commodity hardware. It has many similarities with existing distributed file
systems. However, the differences from other distributed file systems are
significant. It is highly fault-tolerant and is designed to be deployed on low-
cost hardware. It provides high throughput access to application data and is
suitable for applications having large datasets.
Apart from the above-mentioned two core components, Hadoop
framework also includes the following two modules:
Hadoop Common: These are Java libraries and utilities required
by other Hadoop modules.
Hadoop YARN: This is a framework for job scheduling and cluster
resource management.
8. HBASE
HBase is a distributed column-oriented database built on top of the
Hadoop file system. It is an open-source project and is horizontally
scalable.
HBase is a data model that is similar to Google’s big table designed
to provide quick random access to huge amounts of structured data.
It leverages the fault tolerance provided by the Hadoop File System
(HDFS).
It is a part of the Hadoop ecosystem that provides random real-time
read/write access to data in the Hadoop File System.
One can store the data in HDFS either directly or through HBase.
Data consumer reads/accesses the data in HDFS randomly using
HBase. HBase sits on top of the Hadoop File System and provides
read and write access.
9. HIVE
Hive is a data warehouse infrastructure tool to process structured data
in Hadoop. It resides on top of Hadoop to summarize Big Data, and
makes querying and analyzing easy.
Initially Hive was developed by Facebook, later the Apache Software
Foundation took it up and developed it further as an open source
under the name Apache Hive. It is used by different companies. For
example, Amazon uses it in Amazon Elastic MapReduce.
Hive is not
A relational database
A design for OnLine Transaction Processing (OLTP)
A language for real-time queries and row-level updates
10. SQOOP
Sqoop: “SQL to Hadoop and Hadoop to SQL”
Sqoop is a tool designed to transfer data between Hadoop and relational
database servers. It is used to import data from relational databases such as
MySQL, Oracle to Hadoop HDFS, and export from Hadoop file system to
relational databases. It is provided by the Apache Software Foundation.
How Sqoop Works?
11. APPLICATIONS
WEB & E-TAILING
Recommendation Engines
Ad Targeting
Search Quality
Abuse and Click Fraud Detection
BANK & FINANCIAL SERVICES
Modeling True Risk
Thread Analysis
Fraud Detection
Credit Scoring and Analysis
etc…
12. ADVANTAGES & limitations
Advantages
We work on big data sets.
We can process on unstructured data and find out some useful data.
With the help of this only we can do online shopping, online money
transaction, etc.
Social media is the best example of this.
Limitations of Hadoop
Hadoop can perform only batch processing, and data will be accessed
only in a sequential manner. That means one has to search the entire
dataset even for the simplest of jobs.
13. PLATFORM USES
IBM INFOSPHERE BIGINSIGHT
InfoSphere BigInsights is a software platform designed to help
firms
discover and analyze business insights hidden in large volumes
of a
diverse range of data—data that's often ignored or discarded
because
it's too impractical or difficult to process using traditional means.
Examples of such data include log records, click streams, social
media
data, news feeds, electronic sensor output, and even some