1. Big Data
A brief introduction into Big Data
&
Hadoop
01/01/16 F. v. Noort
2. Big Data – A definition
• Big data usually includes data sets (both
structured and unstructured) with sizes beyond
the ability of commonly used software tools to
capture, curate, manage, and process data within
a tolerable elapsed time.
• Doug Laney (2001) 3V’s: “data growth challenges
and opportunities defined as being three-
dimensional, i.e. increasing Volume (amount of
data), Velocity (speed of data in and out), and
Variety (range of data types and sources)”
Big Data – A brief introduction01/01/16 F. v. Noort 1
3. Big Data – A definition
• Gartner (2012): "Big data is high volume, high
velocity, and/or high variety information
assets that require new forms of processing to
enable enhanced decision making, insight
discovery and process optimization."
Big Data – A brief introduction01/01/16 F. v. Noort 2
4. Big Data - Characterization
The original 3V’s have been expanded by the following more
complete set of characteristics:
• Volume: the quantity of generated & stored data
• Velocity: the speed at which is generated & processed
• Variety: the type and nature of the data
• Variability: Inconsistency of the data set can hamper
processes to handle and manage it
• Veracity: The quality of captured data can vary greatly,
affecting accurate analysis.
• Complexity: Managing data coming from multiple sources
can be very challenging. Data must be linked, connected,
and correlated so users can query and process it effectively.
Big Data – A brief introduction01/01/16 F. v. Noort 3
5. Difference Big Data versus BI
• Business Intelligence uses descriptive statistics
with high information density data to measure
things, detect trends, etc.
• Big data (analytics) uses inductive statistics and
concepts from nonlinear system identification to
infer laws (regressions, nonlinear relationships,
and causal effects) from large sets of data with
low information density to reveal relationships
and dependencies, or to perform predictions of
outcomes and behaviors
Big Data – A brief introduction01/01/16 F. v. Noort 4
6. Architecture: Client Server
Big Data – A brief introduction01/01/16 F. v. Noort 5
Server
Client ClientClientClient Client
Client ClientClientClient Client
Client’s can always overwhelm the system!
7. Architecture: Storage Area Network
Big Data – A brief introduction01/01/16 F. v. Noort 6
Central
Contact Point
ServerServerServerServer
Client Client Client Client Client Client
Client’s can always overwhelm the system!
8. Architecture: Google File System (GFS)
• Instead of having a giant
file storage appliance
sitting in the back end,
use industry standard
hardware on a large scale
• Drive high performance
through the shear
number of components
• Reliability through
redundancy & replication
• Computation work is
done there where the
data is
01/01/16 F. v. Noort Big Data – A brief introduction 7
Storage
Compute
Storage
Compute
Storage
Compute
Storage
Compute
Storage
Compute
Storage
Compute
Storage
Compute
Storage
Compute
Storage
Compute
Storage
Compute
Storage
Compute
Storage
Compute
Storage
Compute
Storage
Compute
Storage
Compute
Storage
Compute
9. Hadoop
• Based on work from Google File System + MapReduce
• Doug Cutting & Mike Cafarella created there own
version: Hadoop (named after Doug’s son toy elephant)
• Current distributions based on Open Source and
Vendor Work
– Apache Hadoop
– Cloudera - CH4 w/ Impala
– Hortonworks
– MapR
– AWS
– Windows Azure HDInsight
01/01/16 F. v. Noort Big Data – A brief introduction 8
10. Why use Hadoop?
• Scalability: Scales to Petabytes or more
• Fault tolerant
• Faster: Parallel data processing
• Better: Suited for particular types of BigData
problems
• Open source
• Low cost: can be deployed on commodity
hardware
01/01/16 F. v. Noort Big Data – A brief introduction 9
11. Hadoop Core Architecture
Hadoop core comprises of a
• Distributed File System
HDFS: Hadoop Distributed File System (based on GFS)
File Sharing & Data Protection Across Physical Servers
• Processing paradigm
MapReduce
Distributed Computing Across Physical Servers
01/01/16 F. v. Noort Big Data – A brief introduction 10
MapReduce
HDFS
12. HDFS (1/2)
Hadoop Distributed File System
• Written in JAVA
• On top of native filing system
• Designed to handle very large files with
streaming data access patterns
• Uses blocks to store a file or parts of a file/
splitting of large files into blocks
• Build on X86-standards
Lot’s of flexibility: reference architectures for
many type of servers
01/01/16 F. v. Noort Big Data – A brief introduction 11
Hadoop Distributed File System
X86 X86 X86 X86
13. HDFS (2/2)
• HDFS File Blocks
– 64Mb (default), 128 Mb (recommended)
– 1HDFS block is supported by multiple operations system
(OS) blocks
• Blocks are replicated (default 3x) to multiple nodes
• Allows for node failure without data loss
• Two key services
– Master NameNode
– Many DataNodes
• Checkpoint Node (Secondary NameNode)
01/01/16 F. v. Noort Big Data – A brief introduction 12
14. MapReduce
“take your task which is data oriented, chunk it up and distribute it on
the network such that every piece of work is done within the network
by the machine that has the piece of data that needs to be worked on”
MapReduce
• Processing paradigm that pairs with HDFS
• Distributed computation algorithm that pushes the compute down
to each of the X86 servers
• Fault tolerant
• Parallelized (scalable) processing
• Combination of a Map- and a Reduce procedure:
– Map procedure: performs filtering and sorting of the data
– Reduce procedure: performs summary operations
01/01/16 F. v. Noort Big Data – A brief introduction 13
15. Other Hadoop tools/frameworks
• Data Access:
– Hive, Pig, Mahout
• Tools
– Sqoop, Flume
01/01/16 F. v. Noort Big Data – A brief introduction 14
16. Hadoop Architecture
Main nodes of Hadoop
• Hadoop Distribute Files System (HDFS) nodes
– NameNode
– DataNode
• MapReduce nodes
– JobTracker
– TaskTracker
01/01/16 F. v. Noort Big Data – A brief introduction 15
17. HDFS - NameNode
• Single master service for HDFS
• Single point of failure (HDFS 1.x)
• Stores file to block to location mappings in the
namespace (manages the file system
namespace and metadata)
• Don’t use inexpensive commodity hardware
for this node
• Large memory requirements, keeps the entire
file system metadata in memory
01/01/16 F. v. Noort Big Data – A brief introduction 16
18. HDFS - DataNode
• Many per Hadoop cluster
• Stores blocks on local disk
• Manages blocks with data and serves them to
clients
• Checksums on blocks => fault tolerant data store
system
• Clients connect to DataNode for I/O
• Sends frequent heartbeats (pings “hey I’m alive”
for about every 2 seconds) to NameNode
• Sends block reports to NameNode
01/01/16 F. v. Noort Big Data – A brief introduction 17
19. HDFS Write operation
01/01/16 F. v. Noort Big Data – A brief introduction 18
HDFS Client
File 1
Block 1 Block 2 Block 3
Rack 2
DataNode 7
DataNode 8
DataNode 9
DataNode 12
DataNode 10
DataNode 11
Rack 3
DataNode 13
DataNode 14
DataNode 15
DataNode 18
DataNode 16
DataNode 17
Rack 1
DataNode 1
DataNode 2
DataNode 3
DataNode 6
DataNode 4
DataNode 5
NameNode
Client
divides file
in blocks
Client contacts name node to write data
NameNode says write it to these nodes
(DN1, DN7, DN15)
Block 1
Block 1
Block 1
Block 2
Block 2Block 2
Block 3
Block 3
Block 3
• DataNodes replicate data blocks,
orchestrated by the NameNode
• Default 3 replica’s
• Rack-aware system!
20. HDFS Read operation
01/01/16 F. v. Noort Big Data – A brief introduction 19
HDFS Client
Rack 2
DataNode 7
DataNode 8
DataNode 9
DataNode 12
DataNode 10
DataNode 11
Rack 3
DataNode 13
DataNode 14
DataNode 15
DataNode 18
DataNode 16
DataNode 17
Rack 1
DataNode 1
DataNode 2
DataNode 3
DataNode 6
DataNode 4
DataNode 5
NameNode
Client
divides file
in blocks
Client contacts name node to read data
NameNode says find it on these nodes
Block 1
Block 1
Block 1
Block 2
Block 2Block 2
Block 3
Block 3
Block 3
21. HDFS 2.0 Features
• NameNode High-Availability
– Two redundant NameNodes in active/passive
configuration
– Manual or automated failover
• NameNode Federation
– Multiple independent NameNodes using the same
collection of DataNodes
01/01/16 F. v. Noort Big Data – A brief introduction 20
22. Hadoop MapReduce
• Moves the code to the data
• JobTracker
– Master service to monitor jobs
• TaskTracker
– Multiple services to run tasks
– Same physical machine as a DataNode
• A job contains many tasks
• A task contains one or more task attempts
01/01/16 F. v. Noort Big Data – A brief introduction 21
23. MapReduce JobTracker
• One per Hadoop cluster
• Receives job requests submitted by Client
• Schedules jobs in FIFO order
• Schedules & monitors MapReduce jobs on
task trackers
• Issues task attempts to TaskTrackers
• Single point of failure for MapReduce
01/01/16 F. v. Noort Big Data – A brief introduction 22
24. MapReduce TaskTracker
• Runs on same node as DataNode service
• Many per Hadoop cluster
• Sends heartbeats and task reports to
JobTracker
• Executes MapReduce operations
• Configurable number of map and reduce slots
• Runs map and reduce task attempts
01/01/16 F. v. Noort Big Data – A brief introduction 23
25. HDFS Architecture: Master & Slave
01/01/16 F. v. Noort Big Data – A brief introduction 24
HDFS Client
Secondary NameNodeNameNodeJobTracker
Note
• Hadoop 1.0 has only 1
NameNode
• Hadoop 2.0 has active & passive
NameNode
DataNode
TaskTracker
DataNode
TaskTracker
DataNode
TaskTracker
MapReduce Distributed Data
Processing
26. How MapReduce works (1/3)
• MapReduce is a combination of a Map- and a
Reduce procedure:
– Map procedure: performs filtering and sorting of
the data
– Reduce procedure: performs summary operations
01/01/16 F. v. Noort Big Data – A brief introduction 25
27. How MapReduce works (2/3)
01/01/16 F. v. Noort Big Data – A brief introduction 26
CustId, ZipCode, Amount
4 6654FD €75
7 1534CD €60
2 5734CD €30
1 1184AN €15
5 5734CD €65
0 6654FD €22
5 5734CD €15
6 4484AN €10
3 1534CD €95
8 4484AN €55
6 4484AN €25
9 1184AN €15
Mapper
1
Mapper
2
2 Map Jobs
Scenario: Get sum sales
grouped by ZipCode
6654FD €75
1534CD €60
5734CD €30
1184AN €15
5734CD €65
6654FD €22
5734CD €15
4484AN €10
1534CD €95
4484AN €55
4484AN €25
1184AN €15
Map Phase
29. The Hadoop Ecosystem
• Data Access:
– Hive
– Pig
– Mahout
• Tools
– Sqoop
– Flume
01/01/16 F. v. Noort Big Data – A brief introduction 28
30. Hive
• Declarative language
• Allows users to write
write SQL-like queries
(no ANSI SQL)
• Analytics area
• Structures data
• Data in Tables
• Tables will remain
01/01/16 F. v. Noort Big Data – A brief introduction 29
MapReduce
HDFS
Hive
31. PIG
• Procedural language
(PigLatin)
• Generates one or more
MapReduce jobs
• Efficiency in computing
• Structured/unstructured
data
• Data in Variables
• May not retain values
01/01/16 F. v. Noort Big Data – A brief introduction 30
MapReduce
HDFS
Hive PIG
32. Mahout
• Library for scalable
machine learning
(written in Java)
• Classification,
Clustering, Pattern
Mining, etc ..
01/01/16 F. v. Noort Big Data – A brief introduction 31
MapReduce
HDFS
Hive PIG
Mahout
33. Sqoop
• To transfer data to and
from a relational
database
• Compression of data is
a feature
01/01/16 F. v. Noort Big Data – A brief introduction 32
MapReduce
HDFS
Hive PIG
Mahout
Sqoop
34. Flume
• An application that
allows to move
streaming data to a
Hadoop cluster
• A massively
distributable
framework for event
based data
01/01/16 F. v. Noort Big Data – A brief introduction 33
MapReduce
HDFS
Hive PIG
Mahout
Sqoop
Flume