Big data
PREPARE BY:
AHMED ALTAYEB SHEIKH EDREES
FAWAZ AWAD YAHIA ABDELGADIR
Outlines
1. INTRODUCTION
2. WHAT IS BIG DATA
3. BIG DATA GENERATORS
4. CHARACTERISTIC OF BIG DATA
5. BENEFIT OF BIG DATA
6. HADOOP
 HDFS
 Map Reduce
7. BI VS BIG DATA
Introduction
What is Big data
Is very large data sets that may be analyzed computationally to reveal patterns,
trends, and associations, especially relating to Customers behavior and
interactions.
Big Data in general is defined as high volume, velocity and variety information
assets that demand cost-effective, innovative forms of information processing
for enhanced insight and decision making.”
A technology term about Data that becomes too large to be managed in a
manner that is previously known to work normally.
Big Data generators
This data comes from everywhere:
sensors used to gather climate information,
posts to social media sites,
digital pictures
online Shopping
Airlines
purchase transaction records, and many more…
This data is “ big data.”
Characteristic
“Big data is the data characterized by 3 attributes: volume, velocity and variety .”
Volume
It is the size of the data which determines the value and potential of the data under
consideration. The name ‘Big Data’ itself contains a term which is related to size and
hence the characteristic.
Variety
Data today comes in all types of formats. Structured, numeric data in traditional
databases. Unstructured text documents, email, stock ticker data and financial
transactions and semi-structured data too.
Velocity
speed of generation of data or how fast the data is generated and processed to meet the
demands and the challenges which lie ahead in the path of growth and development.
FB generates 100TB daily
Twitter generates 8TB of data Daily
Benefit of Big data
Cost Reduction from Big Data Technologies
Time Reduction from Big Data
Developing New Big Data-Based Offerings
Supporting Internal Business Decisions
Real-time big data isn’t just a process for storing petabytes or Exabyte's of data in
a data warehouse, It’s about the ability to make better decisions and take
meaningful actions at the right time.
What is Hadoop
Flexible and available architecture for large scale computation and data
performance on a network of commodity hardware
Framework that allows for distributed processing of large data sets across clusters
of commodity servers
– Store large amount of data
– Process the large amount of data stored
Why Hadoop ?
open source,
highly reliable,
distributed data processing platform
Handles large amounts of data
Stores data in native format
Delivers linear scalability at low cost
Resilient in case of infrastructure failures
Transparent application scalability
HDFS Hadoop Distributed File System
HDFS enables Hadoop to store huge files. It’s a scalable file system
that distributes and stores data across all machines in a Hadoop cluster.
Scale-Out Architecture - Add servers to increase capacity
High Availability - Serve mission-critical workflows and applications
Fault Tolerance - Automatically and seamlessly recover from failures
Load Balancing - Place data intelligently for maximum efficiency and
utilization
Tunable Replication - Multiple copies of each file provide data protection and
computational performance
Namenode and datanode
DataNode- There is a piece of software running on each of these nodes of the cluster called Datanode
which
runs on slave nodes which make up the majority of the machines of a cluster. The name node
places the data into these data nodes.
NameNode- It Runs on a master node that tracks and directs the storage of the cluster.
Also we know that the nodes or blocks which make up the original 150 MB file and
that is handled by a separate machine is the Namenode. Information stored here is
called as metadata.
MapReduce
MapReduce is a programming model for processing large data sets with a parallel,
distributed algorithm on a cluster
Scale-out Architecture - Add servers to increase processing power
Security & Authentication - Works with HDFS security to make sure that only
approved users can operate against the data in the system
Resource Manager - Employs data locality and server resources to determine optimal
computing operations
Optimized Scheduling - Completes jobs according to prioritization
Flexibility - Procedures can be written in virtually any programming language
Resiliency & High Availability - Multiple job and task trackers ensure that jobs fail
independently and restart automatically
BI VS Big Data
 big data and hadoop
 big data and hadoop

big data and hadoop

  • 1.
    Big data PREPARE BY: AHMEDALTAYEB SHEIKH EDREES FAWAZ AWAD YAHIA ABDELGADIR
  • 2.
    Outlines 1. INTRODUCTION 2. WHATIS BIG DATA 3. BIG DATA GENERATORS 4. CHARACTERISTIC OF BIG DATA 5. BENEFIT OF BIG DATA 6. HADOOP  HDFS  Map Reduce 7. BI VS BIG DATA
  • 3.
  • 5.
    What is Bigdata Is very large data sets that may be analyzed computationally to reveal patterns, trends, and associations, especially relating to Customers behavior and interactions. Big Data in general is defined as high volume, velocity and variety information assets that demand cost-effective, innovative forms of information processing for enhanced insight and decision making.” A technology term about Data that becomes too large to be managed in a manner that is previously known to work normally.
  • 6.
    Big Data generators Thisdata comes from everywhere: sensors used to gather climate information, posts to social media sites, digital pictures online Shopping Airlines purchase transaction records, and many more… This data is “ big data.”
  • 7.
    Characteristic “Big data isthe data characterized by 3 attributes: volume, velocity and variety .”
  • 8.
    Volume It is thesize of the data which determines the value and potential of the data under consideration. The name ‘Big Data’ itself contains a term which is related to size and hence the characteristic.
  • 9.
    Variety Data today comesin all types of formats. Structured, numeric data in traditional databases. Unstructured text documents, email, stock ticker data and financial transactions and semi-structured data too.
  • 10.
    Velocity speed of generationof data or how fast the data is generated and processed to meet the demands and the challenges which lie ahead in the path of growth and development.
  • 11.
    FB generates 100TBdaily Twitter generates 8TB of data Daily
  • 12.
    Benefit of Bigdata Cost Reduction from Big Data Technologies Time Reduction from Big Data Developing New Big Data-Based Offerings Supporting Internal Business Decisions Real-time big data isn’t just a process for storing petabytes or Exabyte's of data in a data warehouse, It’s about the ability to make better decisions and take meaningful actions at the right time.
  • 13.
    What is Hadoop Flexibleand available architecture for large scale computation and data performance on a network of commodity hardware Framework that allows for distributed processing of large data sets across clusters of commodity servers – Store large amount of data – Process the large amount of data stored
  • 14.
    Why Hadoop ? opensource, highly reliable, distributed data processing platform Handles large amounts of data Stores data in native format Delivers linear scalability at low cost Resilient in case of infrastructure failures Transparent application scalability
  • 15.
    HDFS Hadoop DistributedFile System HDFS enables Hadoop to store huge files. It’s a scalable file system that distributes and stores data across all machines in a Hadoop cluster. Scale-Out Architecture - Add servers to increase capacity High Availability - Serve mission-critical workflows and applications Fault Tolerance - Automatically and seamlessly recover from failures Load Balancing - Place data intelligently for maximum efficiency and utilization Tunable Replication - Multiple copies of each file provide data protection and computational performance
  • 16.
    Namenode and datanode DataNode-There is a piece of software running on each of these nodes of the cluster called Datanode which runs on slave nodes which make up the majority of the machines of a cluster. The name node places the data into these data nodes. NameNode- It Runs on a master node that tracks and directs the storage of the cluster. Also we know that the nodes or blocks which make up the original 150 MB file and that is handled by a separate machine is the Namenode. Information stored here is called as metadata.
  • 19.
    MapReduce MapReduce is aprogramming model for processing large data sets with a parallel, distributed algorithm on a cluster Scale-out Architecture - Add servers to increase processing power Security & Authentication - Works with HDFS security to make sure that only approved users can operate against the data in the system Resource Manager - Employs data locality and server resources to determine optimal computing operations Optimized Scheduling - Completes jobs according to prioritization Flexibility - Procedures can be written in virtually any programming language Resiliency & High Availability - Multiple job and task trackers ensure that jobs fail independently and restart automatically
  • 21.