Hadoop - A big data initiative

HADOOP (A BIG DATA INITIATIVE)
-Mansi Mehra

AGENDA
 What is the problem?
 What is the solution?
 HDFS
 MapReduce
 Hbase
 PIG
 HIVE
 Zookeeper
 Spark

DEFINING THE PROBLEM – 3V
 Volume - Lots and lots of data
 Datasets are so large and complex
 Cannot use relational database
 Challenges: capture, curation, storage, search, sharing,
transfer, analysis and visualization.

DEFINING THE PROBLEM – 3V (CONTD.)
 Velocity - Huge amounts of data generated at
incredible speed
 NYSE generates about 1 TB of new trade data per day
 AT&T anonymized Call Detail Records (CDRs) top at
around 1 GB per hour.
 Variety - Differently formatted data sets from
different sources
 Twitter keeps tracks of tweets, Facebook produces
posts and likes data, Youtube streams videos)

WHY TRADITIONAL STORAGES DON’T WORK
 Unstructured data is exploding, not much of data
produced has relational nature.
 No redundancy
 High computational cost
 Capacity limit for structured data (costly hardware)
 Expensive License
Data type Nature
XML Semi-structured
Word docs, PDF files etc. Unstructured
Email body Unstructured
Data from Enterprise Systems
(ERP, CRM etc.)
Structured

YARN (2.0)- YET ANOTHER RESOURCE
NEGOTIATOR
 Computing framework for Hadoop.
 YARN has Resource Manager-
 Manages and allocates cluster resources
 Improves performance and Quality of Service

MAP REDUCE
 Programming model in Java
 Work on large amounts of data
 Provides redundancy & fault tolerance
 Runs the code on each data node

MAP REDUCE (CONTD.)
 Steps for Map Reduce:
 Read in lots of data
 Map: extract something you care about from each
record/line.
 Shuffle and sort
 Reduce: aggregate, summarize, filter or transform
 Write results.

HIVE (OVERVIEW)
 A data warehouse infrastructure built on top of Hadoop
 Compile SQL queries as MapReduce jobs and run the job in
the cluster.
 Brings structure to unstructured data
 Key Building Principles:
 Structured data with rich data types (structs, lists and maps)
 Directly query data from different formats (text/binary) and file
formats (Flat/sequence).
 SQL as a familiar programming tool and for standard analytics
 Types of applications:
 Summarization: Daily/weekly aggregations
 Ad hoc analysis
 Data Mining
 Spam detection
 Many more ….

PIG (OVERVIEW)
 High level dataflow language
 Has its own syntax (Preferable for people with
programming background)
 Compiler that produces sequences of MapReduce
programs.
 Structure is agreeable to substantial parallelization.
 Key properties of PIG:
 Ease of programming: Trivial to achieve parallel execution of
simple and parallel data analysis tasks
 Optimization opportunities: allows user to focus on semantics
rather than efficiency.
 Extensibility: Users can create their own functions to do
special purpose processing.

HBASE (OVERVIEW)
 HBase is a distributed column-oriented data store built
on top of HDFS.
 Data is logically organized into tables, rows and
columns.
 HDFS is good for batch processing (scan over big files).
 Not good for record lookup.
 Not good for incremental addition of small batches.
 Not good for updates.
 HBase is designed to efficiently address the above
points
 Fast record lookup
 Support for record level insertion
 Support for updates (not in place).
 Updates are done by creating new versions of values.

ZOOKEEPER (OVERVIEW)
 Zookeeper is a distributed, open source
coordination service for distributed applications.
 Exposes simple set of primitives that distributed
applications can build upon to implement higher
level services for synchronization, configuration
maintenance, and groups and naming.
 Coordination services are notoriously hard to get
right. They are prone to errors like race conditions
and deadlock.
 The motivation behind zookeeper is to relieve
distributed applications the responsibility of
implementing coordination services from scratch.

SPARK (OVERVIEW)
 Motivation : MapReduce programming model transform data
flowing from stable storage to stable storage (disk to disk).
 Acyclic data flow is a powerful abstraction, but not efficient for
applications that repeatedly reuse a working set of data.
 Iterative algorithms
 Interactive data mining
 Spark makes working sets a first-class concept to efficiently
support these applications.
 Goal:
 To provide distributed memory abstractions for clusters to support
apps with working sets.
 Retain the attractive properties of map reduce.
 Fault tolerance
 Data locality
 Scalability
 Augment data flow model with “resilient distributed datasets”
(RDDs)

SPARK (OVERVIEW CONTD.)
 Resilient distributed datasets (RDDs)
 Immutable collections partitioned across cluster that can
be rebuilt if a partition is lost.
 Created by transforming data in stable storage using
data flow operators (map, filter, group-by, ..)
 Can be cached across parallel operations.
 Parallel operations on RDDs.
 Reduce, collect, count, save, …..
 Restricted shared variables
 Accumulators, broadcast variables.

SPARK (OVERVIEW CONTD)
 Fast map reduce like engine
 Uses in memory cluster computing
 Compatible with Hadoop storage API.
 Has API’s written in Scala, Java, Python.
 Useful for large datasets and iterative algorithms.
 Up to 40x faster than MapReduce.
 Support for:
 Spark SQL : Hive on Spark
 Mlib : Machine learning library
 Graphx : Graph processing.

Hadoop - A big data initiative

More Related Content

What's hot

Viewers also liked

Similar to Hadoop - A big data initiative

Recently uploaded

Hadoop - A big data initiative

Editor's Notes