Big Data Analysis Using
Hadoop Cluster
By:
Syed Furqan Haider Shah #176
Introduction
What is
BIG DATA
The term Big data is used to describe a massive volume
of both structured and unstructured data that is so large
that it's difficult to process using traditional database
and software techniques.
BIG DATA(contd.)
• Big data consists of a heterogeneous mixture of structured and
unstructured data.
• Big data refers to datasets whose size is beyond the ability of
typical database software tools to capture, store, manage, process
and analyze.
Challenges
• These statistical records keep on increasing and increase
very fast.
• Unfortunately, as the data grows it becomes a tedious task
to process such a large data set and extract meaningful
information.
• If the data generated is in various formats, its processing
possesses new challenges.
Challenges(contd.)
• An issue with big data is that it uses NoSQL and has no Data
Description Language.
• Also, web-scale data is not universal and is heterogeneous. For
analysis of big data, database integration and cleaning is much
harder than the traditional mining approaches.
Solution
• Parallel computing programming
• An efficient platform for computing will not have centralized data
storage instead of that platform will be distributed in big scale
storage.
• Restricting access to the data
HADOOP
HADOOP
Hadoop is basically a tool which operates on a Distributive
File System. In this Architecture, all the Data Nodes
function parallel but functioning of a single Data Node is
still in sequential fashion.
HADOOP Architecture
•It is developed by Apache Software Foundation project and
open source software platform for scalable, distributed
computing.
•Apache Hadoop software library is a framework that allows
for the distributed processing of large data sets across
clusters of computers using simple programming models.
HADOOP Architecture(contd.)
•Hadoop provides fast and reliable analysis of both
Structured and un structured data.
•It is designed to scale up from single servers to thousands of
machines, each offering local computation and storage.
•Hadoop uses Map/Reduce programming model to mine
data.
• This Map Reduce program is used to separate datasets which are sent as
input into independent subsets.Those are process parallel map task.
• Map() procedure that performs filtering and sorting
• Reduce() procedure that performs a summary operation
METHODOLOGY
Methodology
Hadoop’s library is designed to deliver a highly-available service on
top of a cluster of computers. Hadoop Cluster as a whole can be seen
as that consisting of:
1. Core Hadoop
2. Hadoop Ecosystem
Relationship b/w Core Hadoop and Hadoop
Ecosystem
Core Hadoop consists of :
• HDFS
• MapReduce.
Since the commencement of the project, a lot of other softwares
have grown around it.This is called Hadoop Ecosystem
HDFS(HADOOP distributed file system)
• An HDFS instance may consist of a large number of server machines,
each storing a part of the file system data.
• Detection of faults and quick automatic recovery from them is a core
architectural objective of HDFS.
• Applications that run on HDFS need streaming access to their datasets.
MapReduce
It is the basic logic flow of task execution. It comprises
mainly of Mappers and Reducers.
Mappers:
Mappers do the job of extracting the required raw information from
the whole dataset. i.e. In one case it extracts date of sale, name of the
product, selling price and cost price of various products.
MapReduce(contd.)
•Reducers:
It is then sorted according to the key value of Mappers and
passed to Reducers. Reducers do actual processing on this
reduced data provided by Mappers and accomplish the final
task yielding desired output.
Big data analysis using hadoop cluster

Big data analysis using hadoop cluster

  • 1.
    Big Data AnalysisUsing Hadoop Cluster By: Syed Furqan Haider Shah #176
  • 2.
  • 3.
  • 4.
    BIG DATA The termBig data is used to describe a massive volume of both structured and unstructured data that is so large that it's difficult to process using traditional database and software techniques.
  • 5.
    BIG DATA(contd.) • Bigdata consists of a heterogeneous mixture of structured and unstructured data. • Big data refers to datasets whose size is beyond the ability of typical database software tools to capture, store, manage, process and analyze.
  • 6.
    Challenges • These statisticalrecords keep on increasing and increase very fast. • Unfortunately, as the data grows it becomes a tedious task to process such a large data set and extract meaningful information. • If the data generated is in various formats, its processing possesses new challenges.
  • 7.
    Challenges(contd.) • An issuewith big data is that it uses NoSQL and has no Data Description Language. • Also, web-scale data is not universal and is heterogeneous. For analysis of big data, database integration and cleaning is much harder than the traditional mining approaches.
  • 8.
    Solution • Parallel computingprogramming • An efficient platform for computing will not have centralized data storage instead of that platform will be distributed in big scale storage. • Restricting access to the data
  • 9.
  • 10.
    HADOOP Hadoop is basicallya tool which operates on a Distributive File System. In this Architecture, all the Data Nodes function parallel but functioning of a single Data Node is still in sequential fashion.
  • 11.
    HADOOP Architecture •It isdeveloped by Apache Software Foundation project and open source software platform for scalable, distributed computing. •Apache Hadoop software library is a framework that allows for the distributed processing of large data sets across clusters of computers using simple programming models.
  • 12.
    HADOOP Architecture(contd.) •Hadoop providesfast and reliable analysis of both Structured and un structured data. •It is designed to scale up from single servers to thousands of machines, each offering local computation and storage. •Hadoop uses Map/Reduce programming model to mine data.
  • 13.
    • This MapReduce program is used to separate datasets which are sent as input into independent subsets.Those are process parallel map task. • Map() procedure that performs filtering and sorting • Reduce() procedure that performs a summary operation
  • 15.
  • 16.
    Methodology Hadoop’s library isdesigned to deliver a highly-available service on top of a cluster of computers. Hadoop Cluster as a whole can be seen as that consisting of: 1. Core Hadoop 2. Hadoop Ecosystem
  • 17.
    Relationship b/w CoreHadoop and Hadoop Ecosystem Core Hadoop consists of : • HDFS • MapReduce. Since the commencement of the project, a lot of other softwares have grown around it.This is called Hadoop Ecosystem
  • 18.
    HDFS(HADOOP distributed filesystem) • An HDFS instance may consist of a large number of server machines, each storing a part of the file system data. • Detection of faults and quick automatic recovery from them is a core architectural objective of HDFS. • Applications that run on HDFS need streaming access to their datasets.
  • 19.
    MapReduce It is thebasic logic flow of task execution. It comprises mainly of Mappers and Reducers. Mappers: Mappers do the job of extracting the required raw information from the whole dataset. i.e. In one case it extracts date of sale, name of the product, selling price and cost price of various products.
  • 20.
    MapReduce(contd.) •Reducers: It is thensorted according to the key value of Mappers and passed to Reducers. Reducers do actual processing on this reduced data provided by Mappers and accomplish the final task yielding desired output.