Big Data Intro - Hadoop & Definitions

Big Data
A brief introduction into Big Data
&
Hadoop
01/01/16 F. v. Noort

Big Data – A definition
• Big data usually includes data sets (both
structured and unstructured) with sizes beyond
the ability of commonly used software tools to
capture, curate, manage, and process data within
a tolerable elapsed time.
• Doug Laney (2001) 3V’s: “data growth challenges
and opportunities defined as being three-
dimensional, i.e. increasing Volume (amount of
data), Velocity (speed of data in and out), and
Variety (range of data types and sources)”
Big Data – A brief introduction01/01/16 F. v. Noort 1

Big Data – A definition
• Gartner (2012): "Big data is high volume, high
velocity, and/or high variety information
assets that require new forms of processing to
enable enhanced decision making, insight
discovery and process optimization."

Big Data - Characterization
The original 3V’s have been expanded by the following more
complete set of characteristics:
• Volume: the quantity of generated & stored data
• Velocity: the speed at which is generated & processed
• Variety: the type and nature of the data
• Variability: Inconsistency of the data set can hamper
processes to handle and manage it
• Veracity: The quality of captured data can vary greatly,
affecting accurate analysis.
• Complexity: Managing data coming from multiple sources
can be very challenging. Data must be linked, connected,
and correlated so users can query and process it effectively.

Difference Big Data versus BI
• Business Intelligence uses descriptive statistics
with high information density data to measure
things, detect trends, etc.
• Big data (analytics) uses inductive statistics and
concepts from nonlinear system identification to
infer laws (regressions, nonlinear relationships,
and causal effects) from large sets of data with
low information density to reveal relationships
and dependencies, or to perform predictions of
outcomes and behaviors

Architecture: Client Server
Server
Client ClientClientClient Client
Client ClientClientClient Client
Client’s can always overwhelm the system!

Architecture: Storage Area Network
Central
Contact Point
ServerServerServerServer
Client Client Client Client Client Client
Client’s can always overwhelm the system!

Architecture: Google File System (GFS)
• Instead of having a giant
file storage appliance
sitting in the back end,
use industry standard
hardware on a large scale
• Drive high performance
through the shear
number of components
• Reliability through
redundancy & replication
• Computation work is
done there where the
data is
01/01/16 F. v. Noort Big Data – A brief introduction 7
Storage
Compute
Storage
Compute
Storage
Compute
Storage
Compute
Storage
Compute
Storage
Compute
Storage
Compute
Storage
Compute
Storage
Compute
Storage
Compute
Storage
Compute
Storage
Compute
Storage
Compute
Storage
Compute
Storage
Compute
Storage
Compute

Hadoop
• Based on work from Google File System + MapReduce
• Doug Cutting & Mike Cafarella created there own
version: Hadoop (named after Doug’s son toy elephant)
• Current distributions based on Open Source and
Vendor Work
– Apache Hadoop
– Cloudera - CH4 w/ Impala
– Hortonworks
– MapR
– AWS
– Windows Azure HDInsight

Why use Hadoop?
• Scalability: Scales to Petabytes or more
• Fault tolerant
• Faster: Parallel data processing
• Better: Suited for particular types of BigData
problems
• Open source
• Low cost: can be deployed on commodity
hardware

Hadoop Core Architecture
Hadoop core comprises of a
• Distributed File System
 HDFS: Hadoop Distributed File System (based on GFS)
 File Sharing & Data Protection Across Physical Servers
• Processing paradigm
 MapReduce
 Distributed Computing Across Physical Servers
MapReduce
HDFS

HDFS (1/2)
Hadoop Distributed File System
• Written in JAVA
• On top of native filing system
• Designed to handle very large files with
streaming data access patterns
• Uses blocks to store a file or parts of a file/
splitting of large files into blocks
• Build on X86-standards
 Lot’s of flexibility: reference architectures for
many type of servers
Hadoop Distributed File System
X86 X86 X86 X86

HDFS (2/2)
• HDFS File Blocks
– 64Mb (default), 128 Mb (recommended)
– 1HDFS block is supported by multiple operations system
(OS) blocks
• Blocks are replicated (default 3x) to multiple nodes
• Allows for node failure without data loss
• Two key services
– Master NameNode
– Many DataNodes
• Checkpoint Node (Secondary NameNode)

MapReduce
“take your task which is data oriented, chunk it up and distribute it on
the network such that every piece of work is done within the network
by the machine that has the piece of data that needs to be worked on”
MapReduce
• Processing paradigm that pairs with HDFS
• Distributed computation algorithm that pushes the compute down
to each of the X86 servers
• Fault tolerant
• Parallelized (scalable) processing
• Combination of a Map- and a Reduce procedure:
– Map procedure: performs filtering and sorting of the data
– Reduce procedure: performs summary operations

Other Hadoop tools/frameworks
• Data Access:
– Hive, Pig, Mahout
• Tools
– Sqoop, Flume

Hadoop Architecture
Main nodes of Hadoop
• Hadoop Distribute Files System (HDFS) nodes
– NameNode
– DataNode
• MapReduce nodes
– JobTracker
– TaskTracker

HDFS - NameNode
• Single master service for HDFS
• Single point of failure (HDFS 1.x)
• Stores file to block to location mappings in the
namespace (manages the file system
namespace and metadata)
• Don’t use inexpensive commodity hardware
for this node
• Large memory requirements, keeps the entire
file system metadata in memory

HDFS - DataNode
• Many per Hadoop cluster
• Stores blocks on local disk
• Manages blocks with data and serves them to
clients
• Checksums on blocks => fault tolerant data store
system
• Clients connect to DataNode for I/O
• Sends frequent heartbeats (pings “hey I’m alive”
for about every 2 seconds) to NameNode
• Sends block reports to NameNode

HDFS Write operation
HDFS Client
File 1
Block 1 Block 2 Block 3
Rack 2
DataNode 7
DataNode 8
DataNode 9
DataNode 12
DataNode 10
DataNode 11
Rack 3
DataNode 13
DataNode 14
DataNode 15
DataNode 18
DataNode 16
DataNode 17
Rack 1
DataNode 1
DataNode 2
DataNode 3
DataNode 6
DataNode 4
DataNode 5
NameNode
Client
divides file
in blocks
Client contacts name node to write data
NameNode says write it to these nodes
(DN1, DN7, DN15)
Block 1
Block 1
Block 1
Block 2
Block 2Block 2
Block 3
Block 3
Block 3
• DataNodes replicate data blocks,
orchestrated by the NameNode
• Default 3 replica’s
• Rack-aware system!

HDFS Read operation
HDFS Client
Rack 2
DataNode 7
DataNode 8
DataNode 9
DataNode 12
DataNode 10
DataNode 11
Rack 3
DataNode 13
DataNode 14
DataNode 15
DataNode 18
DataNode 16
DataNode 17
Rack 1
DataNode 1
DataNode 2
DataNode 3
DataNode 6
DataNode 4
DataNode 5
NameNode
Client
divides file
in blocks
Client contacts name node to read data
NameNode says find it on these nodes
Block 1
Block 1
Block 1
Block 2
Block 2Block 2
Block 3
Block 3
Block 3

HDFS 2.0 Features
• NameNode High-Availability
– Two redundant NameNodes in active/passive
configuration
– Manual or automated failover
• NameNode Federation
– Multiple independent NameNodes using the same
collection of DataNodes

Hadoop MapReduce
• Moves the code to the data
• JobTracker
– Master service to monitor jobs
• TaskTracker
– Multiple services to run tasks
– Same physical machine as a DataNode
• A job contains many tasks
• A task contains one or more task attempts

MapReduce JobTracker
• One per Hadoop cluster
• Receives job requests submitted by Client
• Schedules jobs in FIFO order
• Schedules & monitors MapReduce jobs on
task trackers
• Issues task attempts to TaskTrackers
• Single point of failure for MapReduce

MapReduce TaskTracker
• Runs on same node as DataNode service
• Many per Hadoop cluster
• Sends heartbeats and task reports to
JobTracker
• Executes MapReduce operations
• Configurable number of map and reduce slots
• Runs map and reduce task attempts

HDFS Architecture: Master & Slave
HDFS Client
Secondary NameNodeNameNodeJobTracker
Note
• Hadoop 1.0 has only 1
NameNode
• Hadoop 2.0 has active & passive
NameNode
DataNode
TaskTracker
DataNode
TaskTracker
DataNode
TaskTracker
MapReduce Distributed Data
Processing

How MapReduce works (1/3)
• MapReduce is a combination of a Map- and a
Reduce procedure:
– Map procedure: performs filtering and sorting of
the data
– Reduce procedure: performs summary operations

CustId, ZipCode, Amount
4 6654FD €75
7 1534CD €60
2 5734CD €30
1 1184AN €15
5 5734CD €65
0 6654FD €22
5 5734CD €15
6 4484AN €10
3 1534CD €95
8 4484AN €55
6 4484AN €25
9 1184AN €15
Mapper
1
Mapper
2
2 Map Jobs
Scenario: Get sum sales
grouped by ZipCode
6654FD €75
1534CD €60
5734CD €30
1184AN €15
5734CD €65
6654FD €22
5734CD €15
4484AN €10
1534CD €95
4484AN €55
4484AN €25
1184AN €15
Map Phase

6654FD €75
1534CD €60
5734CD €30
1184AN €15
5734CD €65
6654FD €22
5734CD €15
4484AN €10
1534CD €95
4484AN €55
4484AN €25
1184AN €15
5734CD €65
5734CD €30
5734CD €15
4484AN €10
4484AN €25
1534CD €60
1534CD €95
4484AN €55
6654FD €75
6654FD €22
1184AN €15
1184AN €15
5734CD €65
5734CD €30
5734CD €15
4484AN €10
4484AN €25
1534CD €60
1534CD €95
4484AN €55
6654FD €75
6654FD €22
1184AN €15
1184AN €15
5734CD €110
1534CD €155
4484AN €90
1184AN €30
6654FD €97
Reducer
Reducer
Reducer
Scenario: Get sum sales
grouped by ZipCode
ShufflePhase
Sort
Sum

The Hadoop Ecosystem
• Data Access:
– Hive
– Pig
– Mahout
• Tools
– Sqoop
– Flume

Hive
• Declarative language
• Allows users to write
write SQL-like queries
(no ANSI SQL)
• Analytics area
• Structures data
• Data in Tables
• Tables will remain
MapReduce
HDFS
Hive

PIG
• Procedural language
(PigLatin)
• Generates one or more
MapReduce jobs
• Efficiency in computing
• Structured/unstructured
data
• Data in Variables
• May not retain values
MapReduce
HDFS
Hive PIG

Mahout
• Library for scalable
machine learning
(written in Java)
• Classification,
Clustering, Pattern
Mining, etc ..
MapReduce
HDFS
Hive PIG
Mahout

Sqoop
• To transfer data to and
from a relational
database
• Compression of data is
a feature
MapReduce
HDFS
Hive PIG
Mahout
Sqoop

Flume
• An application that
allows to move
streaming data to a
Hadoop cluster
• A massively
distributable
framework for event
based data
MapReduce
HDFS
Hive PIG
Mahout
Sqoop
Flume

Big Data Intro - Hadoop & Definitions

Recommended

Recommended

More Related Content

What's hot

What's hot (20)

Viewers also liked

Viewers also liked (20)

Similar to Big Data Intro - Hadoop & Definitions

Similar to Big Data Intro - Hadoop & Definitions (20)

Recently uploaded

Recently uploaded (20)

Big Data Intro - Hadoop & Definitions