BIG DATA
AND HADOOP
Submitted By -
Name - Ashish Rathore
Branch - B.Tech(CSE)
Year - 4th year
Submitted To-
Mr. Dushyant Kumar
Assistant Professor
VGU Jaipur
SUMMARY OF
CONTENTS
OUR MAIN
TOPICS TODAY
Data and Information
What is Big Data and its Types
Sources and characterstics of big data
Importance of Big Data
Big Data Challanges
Tools to Manage Big Data
What is Hadoop and Hadoop as a solution
Hadoop Eco-system
Three major components of Hadoop
Future in Big Data
1.
2.
3.
4.
5.
6.
7.
8.
9.
10.
DIFFERENCE
BETWEEN
DATA AND
INFORMATION
WITHOUT DATA YOU'RE JUST ANOTHER PERSON WITH AN OPINION
-W.EDWARDS DEMING
WHAT IS
BIG DATA?
WHY IS IT IMPORTANT TO US?
the origins of large data sets
go back to the 1960s and '70s
when the world of data was
just getting started with the
first data centers and the
development of the relational
database.
Around 2005, people began
to realize just how much data
users generated through
Facebook, YouTube, and
other online services.
NoSQL also began to gain
popularity during this time.
Users are still generating
huge amounts of data—but
it’s not just humans who are
doing it.With the advent of
the Internet of Things (IoT),
more objects and devices are
connected to the internet,
gathering data on customer
usage patterns and product
performance
ORIGIN
BEGINNING
PRESENT
HISTORY OF BIG DATA
It has been organized into a formatted repository that is typically a
database. It concerns all data which can be stored in database SQL
in a table with rows and columns. Ex - Relational data
STRUCTURED DATA
Semi-structured data is information that does not reside in a
relational database but that have some organizational properties
that make it easier to analyze. Ex- XML , JSON etc.
SEMI - STRUCTURED DATA
Unstructured data is a data which is not organized in a predefined
manner or does not have a predefined data model
Ex - Word, PDF, Text, Media files etc.
UNSTRUCTURED DATA
TYPES OF BIG
DATA
Big data brings together
data from many disparate
sources and applications
raditional data integration
mechanisms, such as ETL
(extract, transform, and
load) generally aren’t up to
the task
Big data requires storage. Your
storage solution can be in the
cloud, on premises, or both.
The cloud is gradually gaining
popularity because it supports
your current compute
requirements and enables you
to spin up resources as needed
INTEGRATE MANAGE
HOW BIG DATA WORKS
Your investment in big data
pays off when you analyze and
act on your data
Explore the data further to
make new discoveries.
Build data models with machine
learning and artificial
intelligence. Put your data to
work.
ANALYZE
FACTS AND FIGURES
SOURCES OF BIG DATA
4 V'S OF
BIG DATA
What the customers want, the solution to their
problems, analyzing their needs according to the
market trends, etc
Companies like Netflix and Procter & Gamble use
big data to anticipate customer demand.
02
Product
Development
We are now able to teach machines instead of
program them. The availability of big data to train
machine learning models makes that possible.
03
Machine
Learning
Their goal is to set the prices in such a way that
profit is maximized. Set the product’s price
according to the customer’s willingness .
04
Product price
optimization
05
Recommendation
engines
WHY IS IT
IMPORTANT TO
US ?
Better decision
making
01
Recommendations based on your previous as
well as current choices made on various online
platforms.
CAPTURING DATA STORAGE
.
CURATION
SEARCHING
.
BIG DATA CHALLANGES
SHARING
.
TRANSFER
.
ANALYSIS
.
PRESENTATION
.
DEPLOYING AND
MANAGING
.
TECHNOLOGIES AND TOOLS
TO HELP MANAGE BIG DATA
Apache
Hadoop
is a framework that allows
parallel data processing
and distributed data
storage
Apache
Spark
is a general-purpose
distributed data
processing framework.
Apache
Kafka
is a stream processing
platform
Apache
Cassandra
is a distributed NoSQL
database management
system.
WHAT IS Hadoop is an open source framework. It is provided by
Apache to process and analyze very huge volume of
data.
It is written in Java and currently used by Google,
Facebook, LinkedIn, Yahoo, Twitter etc.
STORING BIG DATA
Data is stored in blocks across the
DataNodes and you can specify the size of
blocks.
ACCESSING & PROCESSING
THE DATA
Processing logic is sent to the various slave
nodes & then data is processed parallely
across different slave nodes.
STORING VARIETY OF DATA
You can store all kinds of data whether it is
structured, semi-structured or unstructured.
HADOOP-AS-A-SOLUTION
WHERE IS HADOOP USED ?
It is used for -
Search – Yahoo, Amazon,
Zvents
Log processing – Facebook,
Yahoo
Data Warehouse – Facebook,
AOL
Video and Image Analysis – New
York Times, Eyealike
1
2
3
4
5
A distributed file system for reliably storing huge amounts of data in
the form of files.
Hadoop HDFS - 2007
A distributed algorithm framework for the parallel processing of large
datasets on HDFS filesystem
Hadoop MapReduce - 2007
A key-value pair NoSQL database, with column family data
representation and asynchronous masterless replication.
Cassandra - 2008
A key-value pair NoSQL database, with column family data
representation, with master-slave replication
HBase - 2008
A distributed coordination service for distributed applications. It is
based on Paxos algorithm variant called Zab.
Zookeeper - 2008
HADOOP ECO-SYSTEM
COMPONENTS
PART - 1
Pig is a scripting interface over MapReduce for developers who
prefer scripting interface over native Java MapReduce programming
Pig - 2009
6
7
8
9
10
11
Hive is a SQL interface over MapReduce for developers and analysts
who prefer SQL interface over native Java MapReduce programming.
Hive - 2009
A library of machine learning algorithms, implemented on top of
MapReduce, for finding meaningful patterns in HDFS datasets.
Mahout - 2009
A system to schedule applications and services on an HDFS cluster
and manage the cluster resources like memory and CPU.
YARN - 2011
A tool to collect, aggregate, reliably move and ingest large amounts
of data into HDFS
Flume - 2011
It provides libraries for Machine Learning, SQL interface and near
real-time Stream Processing.
Spark - 2012
HADOOP ECO-SYSTEM
COMPONENTS
PART - 2
A tool to import data from RDBMS/DataWarehouse into HDFS/HBase
and export back.
Sqoop - 2010
12
HADOOP HDFS
Data is stored in a distributed manner in HDFS. There are
two components of HDFS - name node and data node.
While there is only one name node, there can be multiple
data nodes.
Provides distributed storage
Can be implemented on commodity hardware
Provides data security
Highly fault-tolerant - If one machine goes down, the data
from that machine goes to the next machine
Features of HDFS
The major advantage of MapReduce is that it is easy to scale data
processing over multiple computing nodes.
MapReduce consists of two distinct tasks – Map and Reduce.
As the name MapReduce suggests, the reducer phase takes place
after the mapper phase has been completed.
the first is the map job, where a block of data is read and
processed to produce key-value pairs as intermediate outputs.
The reducer receives the key-value pair from multiple map jobs.
Then, the reducer aggregates those intermediate data tuples
(intermediate key-value pair) into a smaller set of tuples or key-
value pairs which is the final output.Let us understand more
about MapReduce and its components. MapR
HADOOP MAPREDUCE
Hadoop MapReduce is the processing unit of Hadoop. In
the MapReduce approach, the processing is done at the
slave nodes, and the final result is sent to the master node.
Hadoop YARN acts like an OS to Hadoop. It is a file system that is
built on top of HDFS.
It is responsible for managing cluster resources to make sure you
don't overload one machine.
It performs job scheduling to make sure that the jobs are
scheduled in the right place
HADOOP YARN
Hadoop YARN stands for Yet Another Resource Negotiator.
It is the resource management unit of Hadoop and is
available as a component of Hadoop version 2.
CAREER OPPORTUNITIES IN BIG DATA
DATABASE
ADMINISTRATOR
DATABASE
DEVELOPER
DATA ANALYST
DATA SCIENTIST BIG DATA ENGINEER DATA MODELER
ANY QUERIES ?

Big data and hadoop

  • 1.
    BIG DATA AND HADOOP SubmittedBy - Name - Ashish Rathore Branch - B.Tech(CSE) Year - 4th year Submitted To- Mr. Dushyant Kumar Assistant Professor VGU Jaipur
  • 2.
    SUMMARY OF CONTENTS OUR MAIN TOPICSTODAY Data and Information What is Big Data and its Types Sources and characterstics of big data Importance of Big Data Big Data Challanges Tools to Manage Big Data What is Hadoop and Hadoop as a solution Hadoop Eco-system Three major components of Hadoop Future in Big Data 1. 2. 3. 4. 5. 6. 7. 8. 9. 10.
  • 3.
    DIFFERENCE BETWEEN DATA AND INFORMATION WITHOUT DATAYOU'RE JUST ANOTHER PERSON WITH AN OPINION -W.EDWARDS DEMING
  • 4.
    WHAT IS BIG DATA? WHYIS IT IMPORTANT TO US?
  • 5.
    the origins oflarge data sets go back to the 1960s and '70s when the world of data was just getting started with the first data centers and the development of the relational database. Around 2005, people began to realize just how much data users generated through Facebook, YouTube, and other online services. NoSQL also began to gain popularity during this time. Users are still generating huge amounts of data—but it’s not just humans who are doing it.With the advent of the Internet of Things (IoT), more objects and devices are connected to the internet, gathering data on customer usage patterns and product performance ORIGIN BEGINNING PRESENT HISTORY OF BIG DATA
  • 6.
    It has beenorganized into a formatted repository that is typically a database. It concerns all data which can be stored in database SQL in a table with rows and columns. Ex - Relational data STRUCTURED DATA Semi-structured data is information that does not reside in a relational database but that have some organizational properties that make it easier to analyze. Ex- XML , JSON etc. SEMI - STRUCTURED DATA Unstructured data is a data which is not organized in a predefined manner or does not have a predefined data model Ex - Word, PDF, Text, Media files etc. UNSTRUCTURED DATA TYPES OF BIG DATA
  • 7.
    Big data bringstogether data from many disparate sources and applications raditional data integration mechanisms, such as ETL (extract, transform, and load) generally aren’t up to the task Big data requires storage. Your storage solution can be in the cloud, on premises, or both. The cloud is gradually gaining popularity because it supports your current compute requirements and enables you to spin up resources as needed INTEGRATE MANAGE HOW BIG DATA WORKS Your investment in big data pays off when you analyze and act on your data Explore the data further to make new discoveries. Build data models with machine learning and artificial intelligence. Put your data to work. ANALYZE
  • 8.
  • 9.
  • 10.
  • 11.
    What the customerswant, the solution to their problems, analyzing their needs according to the market trends, etc Companies like Netflix and Procter & Gamble use big data to anticipate customer demand. 02 Product Development We are now able to teach machines instead of program them. The availability of big data to train machine learning models makes that possible. 03 Machine Learning Their goal is to set the prices in such a way that profit is maximized. Set the product’s price according to the customer’s willingness . 04 Product price optimization 05 Recommendation engines WHY IS IT IMPORTANT TO US ? Better decision making 01 Recommendations based on your previous as well as current choices made on various online platforms.
  • 12.
    CAPTURING DATA STORAGE . CURATION SEARCHING . BIGDATA CHALLANGES SHARING . TRANSFER . ANALYSIS . PRESENTATION . DEPLOYING AND MANAGING .
  • 13.
    TECHNOLOGIES AND TOOLS TOHELP MANAGE BIG DATA Apache Hadoop is a framework that allows parallel data processing and distributed data storage Apache Spark is a general-purpose distributed data processing framework. Apache Kafka is a stream processing platform Apache Cassandra is a distributed NoSQL database management system.
  • 14.
    WHAT IS Hadoopis an open source framework. It is provided by Apache to process and analyze very huge volume of data. It is written in Java and currently used by Google, Facebook, LinkedIn, Yahoo, Twitter etc.
  • 15.
    STORING BIG DATA Datais stored in blocks across the DataNodes and you can specify the size of blocks. ACCESSING & PROCESSING THE DATA Processing logic is sent to the various slave nodes & then data is processed parallely across different slave nodes. STORING VARIETY OF DATA You can store all kinds of data whether it is structured, semi-structured or unstructured. HADOOP-AS-A-SOLUTION
  • 16.
    WHERE IS HADOOPUSED ? It is used for - Search – Yahoo, Amazon, Zvents Log processing – Facebook, Yahoo Data Warehouse – Facebook, AOL Video and Image Analysis – New York Times, Eyealike
  • 18.
    1 2 3 4 5 A distributed filesystem for reliably storing huge amounts of data in the form of files. Hadoop HDFS - 2007 A distributed algorithm framework for the parallel processing of large datasets on HDFS filesystem Hadoop MapReduce - 2007 A key-value pair NoSQL database, with column family data representation and asynchronous masterless replication. Cassandra - 2008 A key-value pair NoSQL database, with column family data representation, with master-slave replication HBase - 2008 A distributed coordination service for distributed applications. It is based on Paxos algorithm variant called Zab. Zookeeper - 2008 HADOOP ECO-SYSTEM COMPONENTS PART - 1 Pig is a scripting interface over MapReduce for developers who prefer scripting interface over native Java MapReduce programming Pig - 2009 6
  • 19.
    7 8 9 10 11 Hive is aSQL interface over MapReduce for developers and analysts who prefer SQL interface over native Java MapReduce programming. Hive - 2009 A library of machine learning algorithms, implemented on top of MapReduce, for finding meaningful patterns in HDFS datasets. Mahout - 2009 A system to schedule applications and services on an HDFS cluster and manage the cluster resources like memory and CPU. YARN - 2011 A tool to collect, aggregate, reliably move and ingest large amounts of data into HDFS Flume - 2011 It provides libraries for Machine Learning, SQL interface and near real-time Stream Processing. Spark - 2012 HADOOP ECO-SYSTEM COMPONENTS PART - 2 A tool to import data from RDBMS/DataWarehouse into HDFS/HBase and export back. Sqoop - 2010 12
  • 20.
    HADOOP HDFS Data isstored in a distributed manner in HDFS. There are two components of HDFS - name node and data node. While there is only one name node, there can be multiple data nodes. Provides distributed storage Can be implemented on commodity hardware Provides data security Highly fault-tolerant - If one machine goes down, the data from that machine goes to the next machine Features of HDFS
  • 21.
    The major advantageof MapReduce is that it is easy to scale data processing over multiple computing nodes. MapReduce consists of two distinct tasks – Map and Reduce. As the name MapReduce suggests, the reducer phase takes place after the mapper phase has been completed. the first is the map job, where a block of data is read and processed to produce key-value pairs as intermediate outputs. The reducer receives the key-value pair from multiple map jobs. Then, the reducer aggregates those intermediate data tuples (intermediate key-value pair) into a smaller set of tuples or key- value pairs which is the final output.Let us understand more about MapReduce and its components. MapR HADOOP MAPREDUCE Hadoop MapReduce is the processing unit of Hadoop. In the MapReduce approach, the processing is done at the slave nodes, and the final result is sent to the master node.
  • 22.
    Hadoop YARN actslike an OS to Hadoop. It is a file system that is built on top of HDFS. It is responsible for managing cluster resources to make sure you don't overload one machine. It performs job scheduling to make sure that the jobs are scheduled in the right place HADOOP YARN Hadoop YARN stands for Yet Another Resource Negotiator. It is the resource management unit of Hadoop and is available as a component of Hadoop version 2.
  • 23.
    CAREER OPPORTUNITIES INBIG DATA DATABASE ADMINISTRATOR DATABASE DEVELOPER DATA ANALYST DATA SCIENTIST BIG DATA ENGINEER DATA MODELER
  • 24.