Big data and hadoop

BIG DATA
AND HADOOP
Submitted By -
Name - Ashish Rathore
Branch - B.Tech(CSE)
Year - 4th year
Submitted To-
Mr. Dushyant Kumar
Assistant Professor
VGU Jaipur

SUMMARY OF
CONTENTS
OUR MAIN
TOPICS TODAY
Data and Information
What is Big Data and its Types
Sources and characterstics of big data
Importance of Big Data
Big Data Challanges
Tools to Manage Big Data
What is Hadoop and Hadoop as a solution
Hadoop Eco-system
Three major components of Hadoop
Future in Big Data
1.
2.
3.
4.
5.
6.
7.
8.
9.
10.

DIFFERENCE
BETWEEN
DATA AND
INFORMATION
WITHOUT DATA YOU'RE JUST ANOTHER PERSON WITH AN OPINION
-W.EDWARDS DEMING

WHAT IS
BIG DATA?
WHY IS IT IMPORTANT TO US?

the origins of large data sets
go back to the 1960s and '70s
when the world of data was
just getting started with the
first data centers and the
development of the relational
database.
Around 2005, people began
to realize just how much data
users generated through
Facebook, YouTube, and
other online services.
NoSQL also began to gain
popularity during this time.
Users are still generating
huge amounts of data—but
it’s not just humans who are
doing it.With the advent of
the Internet of Things (IoT),
more objects and devices are
connected to the internet,
gathering data on customer
usage patterns and product
performance
ORIGIN
BEGINNING
PRESENT
HISTORY OF BIG DATA

It has been organized into a formatted repository that is typically a
database. It concerns all data which can be stored in database SQL
in a table with rows and columns. Ex - Relational data
STRUCTURED DATA
Semi-structured data is information that does not reside in a
relational database but that have some organizational properties
that make it easier to analyze. Ex- XML , JSON etc.
SEMI - STRUCTURED DATA
Unstructured data is a data which is not organized in a predefined
manner or does not have a predefined data model
Ex - Word, PDF, Text, Media files etc.
UNSTRUCTURED DATA
TYPES OF BIG
DATA

Big data brings together
data from many disparate
sources and applications
raditional data integration
mechanisms, such as ETL
(extract, transform, and
load) generally aren’t up to
the task
Big data requires storage. Your
storage solution can be in the
cloud, on premises, or both.
The cloud is gradually gaining
popularity because it supports
your current compute
requirements and enables you
to spin up resources as needed
INTEGRATE MANAGE
HOW BIG DATA WORKS
Your investment in big data
pays off when you analyze and
act on your data
Explore the data further to
make new discoveries.
Build data models with machine
learning and artificial
intelligence. Put your data to
work.
ANALYZE

What the customers want, the solution to their
problems, analyzing their needs according to the
market trends, etc
Companies like Netflix and Procter & Gamble use
big data to anticipate customer demand.
02
Product
Development
We are now able to teach machines instead of
program them. The availability of big data to train
machine learning models makes that possible.
03
Machine
Learning
Their goal is to set the prices in such a way that
profit is maximized. Set the product’s price
according to the customer’s willingness .
04
Product price
optimization
05
Recommendation
engines
WHY IS IT
IMPORTANT TO
US ?
Better decision
making
01
Recommendations based on your previous as
well as current choices made on various online
platforms.

CAPTURING DATA STORAGE
.
CURATION
SEARCHING
.
BIG DATA CHALLANGES
SHARING
.
TRANSFER
.
ANALYSIS
.
PRESENTATION
.
DEPLOYING AND
MANAGING
.

TECHNOLOGIES AND TOOLS
TO HELP MANAGE BIG DATA
Apache
Hadoop
is a framework that allows
parallel data processing
and distributed data
storage
Apache
Spark
is a general-purpose
distributed data
processing framework.
Apache
Kafka
is a stream processing
platform
Apache
Cassandra
is a distributed NoSQL
database management
system.

WHAT IS Hadoop is an open source framework. It is provided by
Apache to process and analyze very huge volume of
data.
It is written in Java and currently used by Google,
Facebook, LinkedIn, Yahoo, Twitter etc.

STORING BIG DATA
Data is stored in blocks across the
DataNodes and you can specify the size of
blocks.
ACCESSING & PROCESSING
THE DATA
Processing logic is sent to the various slave
nodes & then data is processed parallely
across different slave nodes.
STORING VARIETY OF DATA
You can store all kinds of data whether it is
structured, semi-structured or unstructured.
HADOOP-AS-A-SOLUTION

WHERE IS HADOOP USED ?
It is used for -
Search – Yahoo, Amazon,
Zvents
Log processing – Facebook,
Yahoo
Data Warehouse – Facebook,
AOL
Video and Image Analysis – New
York Times, Eyealike

1
2
3
4
5
A distributed file system for reliably storing huge amounts of data in
the form of files.
Hadoop HDFS - 2007
A distributed algorithm framework for the parallel processing of large
datasets on HDFS filesystem
Hadoop MapReduce - 2007
A key-value pair NoSQL database, with column family data
representation and asynchronous masterless replication.
Cassandra - 2008
A key-value pair NoSQL database, with column family data
representation, with master-slave replication
HBase - 2008
A distributed coordination service for distributed applications. It is
based on Paxos algorithm variant called Zab.
Zookeeper - 2008
HADOOP ECO-SYSTEM
COMPONENTS
PART - 1
Pig is a scripting interface over MapReduce for developers who
prefer scripting interface over native Java MapReduce programming
Pig - 2009
6

7
8
9
10
11
Hive is a SQL interface over MapReduce for developers and analysts
who prefer SQL interface over native Java MapReduce programming.
Hive - 2009
A library of machine learning algorithms, implemented on top of
MapReduce, for finding meaningful patterns in HDFS datasets.
Mahout - 2009
A system to schedule applications and services on an HDFS cluster
and manage the cluster resources like memory and CPU.
YARN - 2011
A tool to collect, aggregate, reliably move and ingest large amounts
of data into HDFS
Flume - 2011
It provides libraries for Machine Learning, SQL interface and near
real-time Stream Processing.
Spark - 2012
HADOOP ECO-SYSTEM
COMPONENTS
PART - 2
A tool to import data from RDBMS/DataWarehouse into HDFS/HBase
and export back.
Sqoop - 2010
12

HADOOP HDFS
Data is stored in a distributed manner in HDFS. There are
two components of HDFS - name node and data node.
While there is only one name node, there can be multiple
data nodes.
Provides distributed storage
Can be implemented on commodity hardware
Provides data security
Highly fault-tolerant - If one machine goes down, the data
from that machine goes to the next machine
Features of HDFS

The major advantage of MapReduce is that it is easy to scale data
processing over multiple computing nodes.
MapReduce consists of two distinct tasks – Map and Reduce.
As the name MapReduce suggests, the reducer phase takes place
after the mapper phase has been completed.
the first is the map job, where a block of data is read and
processed to produce key-value pairs as intermediate outputs.
The reducer receives the key-value pair from multiple map jobs.
Then, the reducer aggregates those intermediate data tuples
(intermediate key-value pair) into a smaller set of tuples or key-
value pairs which is the final output.Let us understand more
about MapReduce and its components. MapR
HADOOP MAPREDUCE
Hadoop MapReduce is the processing unit of Hadoop. In
the MapReduce approach, the processing is done at the
slave nodes, and the final result is sent to the master node.

Hadoop YARN acts like an OS to Hadoop. It is a file system that is
built on top of HDFS.
It is responsible for managing cluster resources to make sure you
don't overload one machine.
It performs job scheduling to make sure that the jobs are
scheduled in the right place
HADOOP YARN
Hadoop YARN stands for Yet Another Resource Negotiator.
It is the resource management unit of Hadoop and is
available as a component of Hadoop version 2.

CAREER OPPORTUNITIES IN BIG DATA
DATABASE
ADMINISTRATOR
DATABASE
DEVELOPER
DATA ANALYST
DATA SCIENTIST BIG DATA ENGINEER DATA MODELER

Big data and hadoop

Recommended

Recommended

More Related Content

What's hot

What's hot (20)

Similar to Big data and hadoop

Similar to Big data and hadoop (20)

Recently uploaded

Recently uploaded (20)

Big data and hadoop