SlideShare a Scribd company logo
1 of 34
Big Data
A brief introduction into Big Data
&
Hadoop
01/01/16 F. v. Noort
Big Data – A definition
• Big data usually includes data sets (both
structured and unstructured) with sizes beyond
the ability of commonly used software tools to
capture, curate, manage, and process data within
a tolerable elapsed time.
• Doug Laney (2001) 3V’s: “data growth challenges
and opportunities defined as being three-
dimensional, i.e. increasing Volume (amount of
data), Velocity (speed of data in and out), and
Variety (range of data types and sources)”
Big Data – A brief introduction01/01/16 F. v. Noort 1
Big Data – A definition
• Gartner (2012): "Big data is high volume, high
velocity, and/or high variety information
assets that require new forms of processing to
enable enhanced decision making, insight
discovery and process optimization."
Big Data – A brief introduction01/01/16 F. v. Noort 2
Big Data - Characterization
The original 3V’s have been expanded by the following more
complete set of characteristics:
• Volume: the quantity of generated & stored data
• Velocity: the speed at which is generated & processed
• Variety: the type and nature of the data
• Variability: Inconsistency of the data set can hamper
processes to handle and manage it
• Veracity: The quality of captured data can vary greatly,
affecting accurate analysis.
• Complexity: Managing data coming from multiple sources
can be very challenging. Data must be linked, connected,
and correlated so users can query and process it effectively.
Big Data – A brief introduction01/01/16 F. v. Noort 3
Difference Big Data versus BI
• Business Intelligence uses descriptive statistics
with high information density data to measure
things, detect trends, etc.
• Big data (analytics) uses inductive statistics and
concepts from nonlinear system identification to
infer laws (regressions, nonlinear relationships,
and causal effects) from large sets of data with
low information density to reveal relationships
and dependencies, or to perform predictions of
outcomes and behaviors
Big Data – A brief introduction01/01/16 F. v. Noort 4
Architecture: Client Server
Big Data – A brief introduction01/01/16 F. v. Noort 5
Server
Client ClientClientClient Client
Client ClientClientClient Client
Client’s can always overwhelm the system!
Architecture: Storage Area Network
Big Data – A brief introduction01/01/16 F. v. Noort 6
Central
Contact Point
ServerServerServerServer
Client Client Client Client Client Client
Client’s can always overwhelm the system!
Architecture: Google File System (GFS)
• Instead of having a giant
file storage appliance
sitting in the back end,
use industry standard
hardware on a large scale
• Drive high performance
through the shear
number of components
• Reliability through
redundancy & replication
• Computation work is
done there where the
data is
01/01/16 F. v. Noort Big Data – A brief introduction 7
Storage
Compute
Storage
Compute
Storage
Compute
Storage
Compute
Storage
Compute
Storage
Compute
Storage
Compute
Storage
Compute
Storage
Compute
Storage
Compute
Storage
Compute
Storage
Compute
Storage
Compute
Storage
Compute
Storage
Compute
Storage
Compute
Hadoop
• Based on work from Google File System + MapReduce
• Doug Cutting & Mike Cafarella created there own
version: Hadoop (named after Doug’s son toy elephant)
• Current distributions based on Open Source and
Vendor Work
– Apache Hadoop
– Cloudera - CH4 w/ Impala
– Hortonworks
– MapR
– AWS
– Windows Azure HDInsight
01/01/16 F. v. Noort Big Data – A brief introduction 8
Why use Hadoop?
• Scalability: Scales to Petabytes or more
• Fault tolerant
• Faster: Parallel data processing
• Better: Suited for particular types of BigData
problems
• Open source
• Low cost: can be deployed on commodity
hardware
01/01/16 F. v. Noort Big Data – A brief introduction 9
Hadoop Core Architecture
Hadoop core comprises of a
• Distributed File System
 HDFS: Hadoop Distributed File System (based on GFS)
 File Sharing & Data Protection Across Physical Servers
• Processing paradigm
 MapReduce
 Distributed Computing Across Physical Servers
01/01/16 F. v. Noort Big Data – A brief introduction 10
MapReduce
HDFS
HDFS (1/2)
Hadoop Distributed File System
• Written in JAVA
• On top of native filing system
• Designed to handle very large files with
streaming data access patterns
• Uses blocks to store a file or parts of a file/
splitting of large files into blocks
• Build on X86-standards
 Lot’s of flexibility: reference architectures for
many type of servers
01/01/16 F. v. Noort Big Data – A brief introduction 11
Hadoop Distributed File System
X86 X86 X86 X86
HDFS (2/2)
• HDFS File Blocks
– 64Mb (default), 128 Mb (recommended)
– 1HDFS block is supported by multiple operations system
(OS) blocks
• Blocks are replicated (default 3x) to multiple nodes
• Allows for node failure without data loss
• Two key services
– Master NameNode
– Many DataNodes
• Checkpoint Node (Secondary NameNode)
01/01/16 F. v. Noort Big Data – A brief introduction 12
MapReduce
“take your task which is data oriented, chunk it up and distribute it on
the network such that every piece of work is done within the network
by the machine that has the piece of data that needs to be worked on”
MapReduce
• Processing paradigm that pairs with HDFS
• Distributed computation algorithm that pushes the compute down
to each of the X86 servers
• Fault tolerant
• Parallelized (scalable) processing
• Combination of a Map- and a Reduce procedure:
– Map procedure: performs filtering and sorting of the data
– Reduce procedure: performs summary operations
01/01/16 F. v. Noort Big Data – A brief introduction 13
Other Hadoop tools/frameworks
• Data Access:
– Hive, Pig, Mahout
• Tools
– Sqoop, Flume
01/01/16 F. v. Noort Big Data – A brief introduction 14
Hadoop Architecture
Main nodes of Hadoop
• Hadoop Distribute Files System (HDFS) nodes
– NameNode
– DataNode
• MapReduce nodes
– JobTracker
– TaskTracker
01/01/16 F. v. Noort Big Data – A brief introduction 15
HDFS - NameNode
• Single master service for HDFS
• Single point of failure (HDFS 1.x)
• Stores file to block to location mappings in the
namespace (manages the file system
namespace and metadata)
• Don’t use inexpensive commodity hardware
for this node
• Large memory requirements, keeps the entire
file system metadata in memory
01/01/16 F. v. Noort Big Data – A brief introduction 16
HDFS - DataNode
• Many per Hadoop cluster
• Stores blocks on local disk
• Manages blocks with data and serves them to
clients
• Checksums on blocks => fault tolerant data store
system
• Clients connect to DataNode for I/O
• Sends frequent heartbeats (pings “hey I’m alive”
for about every 2 seconds) to NameNode
• Sends block reports to NameNode
01/01/16 F. v. Noort Big Data – A brief introduction 17
HDFS Write operation
01/01/16 F. v. Noort Big Data – A brief introduction 18
HDFS Client
File 1
Block 1 Block 2 Block 3
Rack 2
DataNode 7
DataNode 8
DataNode 9
DataNode 12
DataNode 10
DataNode 11
Rack 3
DataNode 13
DataNode 14
DataNode 15
DataNode 18
DataNode 16
DataNode 17
Rack 1
DataNode 1
DataNode 2
DataNode 3
DataNode 6
DataNode 4
DataNode 5
NameNode
Client
divides file
in blocks
Client contacts name node to write data
NameNode says write it to these nodes
(DN1, DN7, DN15)
Block 1
Block 1
Block 1
Block 2
Block 2Block 2
Block 3
Block 3
Block 3
• DataNodes replicate data blocks,
orchestrated by the NameNode
• Default 3 replica’s
• Rack-aware system!
HDFS Read operation
01/01/16 F. v. Noort Big Data – A brief introduction 19
HDFS Client
Rack 2
DataNode 7
DataNode 8
DataNode 9
DataNode 12
DataNode 10
DataNode 11
Rack 3
DataNode 13
DataNode 14
DataNode 15
DataNode 18
DataNode 16
DataNode 17
Rack 1
DataNode 1
DataNode 2
DataNode 3
DataNode 6
DataNode 4
DataNode 5
NameNode
Client
divides file
in blocks
Client contacts name node to read data
NameNode says find it on these nodes
Block 1
Block 1
Block 1
Block 2
Block 2Block 2
Block 3
Block 3
Block 3
HDFS 2.0 Features
• NameNode High-Availability
– Two redundant NameNodes in active/passive
configuration
– Manual or automated failover
• NameNode Federation
– Multiple independent NameNodes using the same
collection of DataNodes
01/01/16 F. v. Noort Big Data – A brief introduction 20
Hadoop MapReduce
• Moves the code to the data
• JobTracker
– Master service to monitor jobs
• TaskTracker
– Multiple services to run tasks
– Same physical machine as a DataNode
• A job contains many tasks
• A task contains one or more task attempts
01/01/16 F. v. Noort Big Data – A brief introduction 21
MapReduce JobTracker
• One per Hadoop cluster
• Receives job requests submitted by Client
• Schedules jobs in FIFO order
• Schedules & monitors MapReduce jobs on
task trackers
• Issues task attempts to TaskTrackers
• Single point of failure for MapReduce
01/01/16 F. v. Noort Big Data – A brief introduction 22
MapReduce TaskTracker
• Runs on same node as DataNode service
• Many per Hadoop cluster
• Sends heartbeats and task reports to
JobTracker
• Executes MapReduce operations
• Configurable number of map and reduce slots
• Runs map and reduce task attempts
01/01/16 F. v. Noort Big Data – A brief introduction 23
HDFS Architecture: Master & Slave
01/01/16 F. v. Noort Big Data – A brief introduction 24
HDFS Client
Secondary NameNodeNameNodeJobTracker
Note
• Hadoop 1.0 has only 1
NameNode
• Hadoop 2.0 has active & passive
NameNode
DataNode
TaskTracker
DataNode
TaskTracker
DataNode
TaskTracker
MapReduce Distributed Data
Processing
How MapReduce works (1/3)
• MapReduce is a combination of a Map- and a
Reduce procedure:
– Map procedure: performs filtering and sorting of
the data
– Reduce procedure: performs summary operations
01/01/16 F. v. Noort Big Data – A brief introduction 25
How MapReduce works (2/3)
01/01/16 F. v. Noort Big Data – A brief introduction 26
CustId, ZipCode, Amount
4 6654FD €75
7 1534CD €60
2 5734CD €30
1 1184AN €15
5 5734CD €65
0 6654FD €22
5 5734CD €15
6 4484AN €10
3 1534CD €95
8 4484AN €55
6 4484AN €25
9 1184AN €15
Mapper
1
Mapper
2
2 Map Jobs
Scenario: Get sum sales
grouped by ZipCode
6654FD €75
1534CD €60
5734CD €30
1184AN €15
5734CD €65
6654FD €22
5734CD €15
4484AN €10
1534CD €95
4484AN €55
4484AN €25
1184AN €15
Map Phase
How MapReduce works (3/3)
01/01/16 F. v. Noort Big Data – A brief introduction 27
6654FD €75
1534CD €60
5734CD €30
1184AN €15
5734CD €65
6654FD €22
5734CD €15
4484AN €10
1534CD €95
4484AN €55
4484AN €25
1184AN €15
5734CD €65
5734CD €30
5734CD €15
4484AN €10
4484AN €25
1534CD €60
1534CD €95
4484AN €55
6654FD €75
6654FD €22
1184AN €15
1184AN €15
5734CD €65
5734CD €30
5734CD €15
4484AN €10
4484AN €25
1534CD €60
1534CD €95
4484AN €55
6654FD €75
6654FD €22
1184AN €15
1184AN €15
5734CD €110
1534CD €155
4484AN €90
1184AN €30
6654FD €97
Reducer
Reducer
Reducer
Scenario: Get sum sales
grouped by ZipCode
ShufflePhase
Sort
Sum
The Hadoop Ecosystem
• Data Access:
– Hive
– Pig
– Mahout
• Tools
– Sqoop
– Flume
01/01/16 F. v. Noort Big Data – A brief introduction 28
Hive
• Declarative language
• Allows users to write
write SQL-like queries
(no ANSI SQL)
• Analytics area
• Structures data
• Data in Tables
• Tables will remain
01/01/16 F. v. Noort Big Data – A brief introduction 29
MapReduce
HDFS
Hive
PIG
• Procedural language
(PigLatin)
• Generates one or more
MapReduce jobs
• Efficiency in computing
• Structured/unstructured
data
• Data in Variables
• May not retain values
01/01/16 F. v. Noort Big Data – A brief introduction 30
MapReduce
HDFS
Hive PIG
Mahout
• Library for scalable
machine learning
(written in Java)
• Classification,
Clustering, Pattern
Mining, etc ..
01/01/16 F. v. Noort Big Data – A brief introduction 31
MapReduce
HDFS
Hive PIG
Mahout
Sqoop
• To transfer data to and
from a relational
database
• Compression of data is
a feature
01/01/16 F. v. Noort Big Data – A brief introduction 32
MapReduce
HDFS
Hive PIG
Mahout
Sqoop
Flume
• An application that
allows to move
streaming data to a
Hadoop cluster
• A massively
distributable
framework for event
based data
01/01/16 F. v. Noort Big Data – A brief introduction 33
MapReduce
HDFS
Hive PIG
Mahout
Sqoop
Flume

More Related Content

What's hot

BigData Analytics with Hadoop and BIRT
BigData Analytics with Hadoop and BIRTBigData Analytics with Hadoop and BIRT
BigData Analytics with Hadoop and BIRTAmrit Chhetri
 
Big Data Analytics with Hadoop, MongoDB and SQL Server
Big Data Analytics with Hadoop, MongoDB and SQL ServerBig Data Analytics with Hadoop, MongoDB and SQL Server
Big Data Analytics with Hadoop, MongoDB and SQL ServerMark Kromer
 
Hadoop and BigData - July 2016
Hadoop and BigData - July 2016Hadoop and BigData - July 2016
Hadoop and BigData - July 2016Ranjith Sekar
 
Open source stak of big data techs open suse asia
Open source stak of big data techs   open suse asiaOpen source stak of big data techs   open suse asia
Open source stak of big data techs open suse asiaMuhammad Rifqi
 
Hadoop: An Industry Perspective
Hadoop: An Industry PerspectiveHadoop: An Industry Perspective
Hadoop: An Industry PerspectiveCloudera, Inc.
 
Big Data Technology Stack : Nutshell
Big Data Technology Stack : NutshellBig Data Technology Stack : Nutshell
Big Data Technology Stack : NutshellKhalid Imran
 
Big Data - An Overview
Big Data -  An OverviewBig Data -  An Overview
Big Data - An OverviewArvind Kalyan
 
Intro to Big Data Hadoop
Intro to Big Data HadoopIntro to Big Data Hadoop
Intro to Big Data HadoopApache Apex
 
Introduction to Big Data Hadoop Training Online by www.itjobzone.biz
Introduction to Big Data Hadoop Training Online by www.itjobzone.bizIntroduction to Big Data Hadoop Training Online by www.itjobzone.biz
Introduction to Big Data Hadoop Training Online by www.itjobzone.bizITJobZone.biz
 
Introduction and Overview of BigData, Hadoop, Distributed Computing - BigData...
Introduction and Overview of BigData, Hadoop, Distributed Computing - BigData...Introduction and Overview of BigData, Hadoop, Distributed Computing - BigData...
Introduction and Overview of BigData, Hadoop, Distributed Computing - BigData...Mahantesh Angadi
 
Introduction to Big Data and Hadoop
Introduction to Big Data and HadoopIntroduction to Big Data and Hadoop
Introduction to Big Data and HadoopFebiyan Rachman
 
Introduction to Bigdata and HADOOP
Introduction to Bigdata and HADOOP Introduction to Bigdata and HADOOP
Introduction to Bigdata and HADOOP vinoth kumar
 
Lecture4 big data technology foundations
Lecture4 big data technology foundationsLecture4 big data technology foundations
Lecture4 big data technology foundationshktripathy
 
An introduction to Big Data
An introduction to Big DataAn introduction to Big Data
An introduction to Big DataForwardSprint
 
Big Data Final Presentation
Big Data Final PresentationBig Data Final Presentation
Big Data Final Presentation17aroumougamh
 
Bio bigdata
Bio bigdata Bio bigdata
Bio bigdata Mk Kim
 

What's hot (20)

BigData Analytics with Hadoop and BIRT
BigData Analytics with Hadoop and BIRTBigData Analytics with Hadoop and BIRT
BigData Analytics with Hadoop and BIRT
 
Big Data Analytics with Hadoop, MongoDB and SQL Server
Big Data Analytics with Hadoop, MongoDB and SQL ServerBig Data Analytics with Hadoop, MongoDB and SQL Server
Big Data Analytics with Hadoop, MongoDB and SQL Server
 
Hadoop and BigData - July 2016
Hadoop and BigData - July 2016Hadoop and BigData - July 2016
Hadoop and BigData - July 2016
 
BIG DATA
BIG DATABIG DATA
BIG DATA
 
Open source stak of big data techs open suse asia
Open source stak of big data techs   open suse asiaOpen source stak of big data techs   open suse asia
Open source stak of big data techs open suse asia
 
What is hadoop
What is hadoopWhat is hadoop
What is hadoop
 
Hadoop: An Industry Perspective
Hadoop: An Industry PerspectiveHadoop: An Industry Perspective
Hadoop: An Industry Perspective
 
Big Data Technology Stack : Nutshell
Big Data Technology Stack : NutshellBig Data Technology Stack : Nutshell
Big Data Technology Stack : Nutshell
 
Big Data - An Overview
Big Data -  An OverviewBig Data -  An Overview
Big Data - An Overview
 
Intro to Big Data Hadoop
Intro to Big Data HadoopIntro to Big Data Hadoop
Intro to Big Data Hadoop
 
Introduction to Big Data Hadoop Training Online by www.itjobzone.biz
Introduction to Big Data Hadoop Training Online by www.itjobzone.bizIntroduction to Big Data Hadoop Training Online by www.itjobzone.biz
Introduction to Big Data Hadoop Training Online by www.itjobzone.biz
 
Introduction and Overview of BigData, Hadoop, Distributed Computing - BigData...
Introduction and Overview of BigData, Hadoop, Distributed Computing - BigData...Introduction and Overview of BigData, Hadoop, Distributed Computing - BigData...
Introduction and Overview of BigData, Hadoop, Distributed Computing - BigData...
 
Introduction to Big Data and Hadoop
Introduction to Big Data and HadoopIntroduction to Big Data and Hadoop
Introduction to Big Data and Hadoop
 
Introduction to Bigdata and HADOOP
Introduction to Bigdata and HADOOP Introduction to Bigdata and HADOOP
Introduction to Bigdata and HADOOP
 
Big data abstract
Big data abstractBig data abstract
Big data abstract
 
Lecture4 big data technology foundations
Lecture4 big data technology foundationsLecture4 big data technology foundations
Lecture4 big data technology foundations
 
An introduction to Big Data
An introduction to Big DataAn introduction to Big Data
An introduction to Big Data
 
Big Data Final Presentation
Big Data Final PresentationBig Data Final Presentation
Big Data Final Presentation
 
Big data analytics - hadoop
Big data analytics - hadoopBig data analytics - hadoop
Big data analytics - hadoop
 
Bio bigdata
Bio bigdata Bio bigdata
Bio bigdata
 

Viewers also liked

Strategyzing big data in telco industry
Strategyzing big data in telco industryStrategyzing big data in telco industry
Strategyzing big data in telco industryParviz Iskhakov
 
Introduction to Big Data
Introduction to Big DataIntroduction to Big Data
Introduction to Big DataJoey Li
 
Big data deep learning: applications and challenges
Big data deep learning: applications and challengesBig data deep learning: applications and challenges
Big data deep learning: applications and challengesfazail amin
 
A brief history of "big data"
A brief history of "big data"A brief history of "big data"
A brief history of "big data"Nicola Ferraro
 
Introduction to Big Data
Introduction to Big DataIntroduction to Big Data
Introduction to Big DataKristof Jozsa
 
Virtualization, the cloud enabler
Virtualization, the cloud enablerVirtualization, the cloud enabler
Virtualization, the cloud enablerPraveen Hanchinal
 
Big data and its applications
Big data and its applicationsBig data and its applications
Big data and its applicationsali easazadeh
 
Real-Time Data Flows with Apache NiFi
Real-Time Data Flows with Apache NiFiReal-Time Data Flows with Apache NiFi
Real-Time Data Flows with Apache NiFiManish Gupta
 
Apache NiFi- MiNiFi meetup Slides
Apache NiFi- MiNiFi meetup SlidesApache NiFi- MiNiFi meetup Slides
Apache NiFi- MiNiFi meetup SlidesIsheeta Sanghi
 
A Brief History of Big Data
A Brief History of Big DataA Brief History of Big Data
A Brief History of Big DataBernard Marr
 
Big Data - The 5 Vs Everyone Must Know
Big Data - The 5 Vs Everyone Must KnowBig Data - The 5 Vs Everyone Must Know
Big Data - The 5 Vs Everyone Must KnowBernard Marr
 
Principles of microservices velocity
Principles of microservices   velocityPrinciples of microservices   velocity
Principles of microservices velocitySam Newman
 
Big Data in Healthcare Made Simple: Where It Stands Today and Where It’s Going
Big Data in Healthcare Made Simple: Where It Stands Today and Where It’s GoingBig Data in Healthcare Made Simple: Where It Stands Today and Where It’s Going
Big Data in Healthcare Made Simple: Where It Stands Today and Where It’s GoingHealth Catalyst
 

Viewers also liked (20)

Strategyzing big data in telco industry
Strategyzing big data in telco industryStrategyzing big data in telco industry
Strategyzing big data in telco industry
 
Introduction to Big Data
Introduction to Big DataIntroduction to Big Data
Introduction to Big Data
 
Big data deep learning: applications and challenges
Big data deep learning: applications and challengesBig data deep learning: applications and challenges
Big data deep learning: applications and challenges
 
A brief history of "big data"
A brief history of "big data"A brief history of "big data"
A brief history of "big data"
 
Introduction to Big Data
Introduction to Big DataIntroduction to Big Data
Introduction to Big Data
 
Virtualization, the cloud enabler
Virtualization, the cloud enablerVirtualization, the cloud enabler
Virtualization, the cloud enabler
 
Big data and its applications
Big data and its applicationsBig data and its applications
Big data and its applications
 
Real-Time Data Flows with Apache NiFi
Real-Time Data Flows with Apache NiFiReal-Time Data Flows with Apache NiFi
Real-Time Data Flows with Apache NiFi
 
Apache NiFi- MiNiFi meetup Slides
Apache NiFi- MiNiFi meetup SlidesApache NiFi- MiNiFi meetup Slides
Apache NiFi- MiNiFi meetup Slides
 
Big Data simplified
Big Data simplifiedBig Data simplified
Big Data simplified
 
Big Data Application Architectures - IoT
Big Data Application Architectures - IoTBig Data Application Architectures - IoT
Big Data Application Architectures - IoT
 
Big Data Tutorial V4
Big Data Tutorial V4Big Data Tutorial V4
Big Data Tutorial V4
 
A Brief History of Big Data
A Brief History of Big DataA Brief History of Big Data
A Brief History of Big Data
 
Big Data Trends
Big Data TrendsBig Data Trends
Big Data Trends
 
Big Idea For Big Data
Big Idea For Big DataBig Idea For Big Data
Big Idea For Big Data
 
Big Data - The 5 Vs Everyone Must Know
Big Data - The 5 Vs Everyone Must KnowBig Data - The 5 Vs Everyone Must Know
Big Data - The 5 Vs Everyone Must Know
 
Principles of microservices velocity
Principles of microservices   velocityPrinciples of microservices   velocity
Principles of microservices velocity
 
What is big data?
What is big data?What is big data?
What is big data?
 
Big Data in Healthcare Made Simple: Where It Stands Today and Where It’s Going
Big Data in Healthcare Made Simple: Where It Stands Today and Where It’s GoingBig Data in Healthcare Made Simple: Where It Stands Today and Where It’s Going
Big Data in Healthcare Made Simple: Where It Stands Today and Where It’s Going
 
Big data ppt
Big data pptBig data ppt
Big data ppt
 

Similar to Big Data Intro - Hadoop & Definitions

Константин Швачко, Yahoo!, - Scaling Storage and Computation with Hadoop
Константин Швачко, Yahoo!, - Scaling Storage and Computation with HadoopКонстантин Швачко, Yahoo!, - Scaling Storage and Computation with Hadoop
Константин Швачко, Yahoo!, - Scaling Storage and Computation with HadoopMedia Gorod
 
P.Maharajothi,II-M.sc(computer science),Bon secours college for women,thanjavur.
P.Maharajothi,II-M.sc(computer science),Bon secours college for women,thanjavur.P.Maharajothi,II-M.sc(computer science),Bon secours college for women,thanjavur.
P.Maharajothi,II-M.sc(computer science),Bon secours college for women,thanjavur.MaharajothiP
 
Hadoop Fundamentals
Hadoop FundamentalsHadoop Fundamentals
Hadoop Fundamentalsits_skm
 
Apache hadoop: POSH Meetup Palo Alto, CA April 2014
Apache hadoop: POSH Meetup Palo Alto, CA April 2014Apache hadoop: POSH Meetup Palo Alto, CA April 2014
Apache hadoop: POSH Meetup Palo Alto, CA April 2014Kevin Crocker
 
Big data and hadoop overvew
Big data and hadoop overvewBig data and hadoop overvew
Big data and hadoop overvewKunal Khanna
 
Big data and hadoop
Big data and hadoopBig data and hadoop
Big data and hadoopMohit Tare
 
IOD 2013 - Crunch Big Data in the Cloud with IBM BigInsights and Hadoop
IOD 2013 - Crunch Big Data in the Cloud with IBM BigInsights and HadoopIOD 2013 - Crunch Big Data in the Cloud with IBM BigInsights and Hadoop
IOD 2013 - Crunch Big Data in the Cloud with IBM BigInsights and HadoopLeons Petražickis
 
Apache hadoop, hdfs and map reduce Overview
Apache hadoop, hdfs and map reduce OverviewApache hadoop, hdfs and map reduce Overview
Apache hadoop, hdfs and map reduce OverviewNisanth Simon
 

Similar to Big Data Intro - Hadoop & Definitions (20)

2. hadoop fundamentals
2. hadoop fundamentals2. hadoop fundamentals
2. hadoop fundamentals
 
Big data Hadoop
Big data  Hadoop   Big data  Hadoop
Big data Hadoop
 
Константин Швачко, Yahoo!, - Scaling Storage and Computation with Hadoop
Константин Швачко, Yahoo!, - Scaling Storage and Computation with HadoopКонстантин Швачко, Yahoo!, - Scaling Storage and Computation with Hadoop
Константин Швачко, Yahoo!, - Scaling Storage and Computation with Hadoop
 
P.Maharajothi,II-M.sc(computer science),Bon secours college for women,thanjavur.
P.Maharajothi,II-M.sc(computer science),Bon secours college for women,thanjavur.P.Maharajothi,II-M.sc(computer science),Bon secours college for women,thanjavur.
P.Maharajothi,II-M.sc(computer science),Bon secours college for women,thanjavur.
 
Hadoop fundamentals
Hadoop fundamentalsHadoop fundamentals
Hadoop fundamentals
 
Hadoop Fundamentals
Hadoop FundamentalsHadoop Fundamentals
Hadoop Fundamentals
 
List of Engineering Colleges in Uttarakhand
List of Engineering Colleges in UttarakhandList of Engineering Colleges in Uttarakhand
List of Engineering Colleges in Uttarakhand
 
Hadoop.pptx
Hadoop.pptxHadoop.pptx
Hadoop.pptx
 
Hadoop.pptx
Hadoop.pptxHadoop.pptx
Hadoop.pptx
 
Apache hadoop: POSH Meetup Palo Alto, CA April 2014
Apache hadoop: POSH Meetup Palo Alto, CA April 2014Apache hadoop: POSH Meetup Palo Alto, CA April 2014
Apache hadoop: POSH Meetup Palo Alto, CA April 2014
 
Big data and hadoop overvew
Big data and hadoop overvewBig data and hadoop overvew
Big data and hadoop overvew
 
Big data applications
Big data applicationsBig data applications
Big data applications
 
Big data and hadoop
Big data and hadoopBig data and hadoop
Big data and hadoop
 
Lecture 2 part 1
Lecture 2 part 1Lecture 2 part 1
Lecture 2 part 1
 
getFamiliarWithHadoop
getFamiliarWithHadoopgetFamiliarWithHadoop
getFamiliarWithHadoop
 
Hadoop
HadoopHadoop
Hadoop
 
IOD 2013 - Crunch Big Data in the Cloud with IBM BigInsights and Hadoop
IOD 2013 - Crunch Big Data in the Cloud with IBM BigInsights and HadoopIOD 2013 - Crunch Big Data in the Cloud with IBM BigInsights and Hadoop
IOD 2013 - Crunch Big Data in the Cloud with IBM BigInsights and Hadoop
 
Introduction to Hadoop
Introduction to HadoopIntroduction to Hadoop
Introduction to Hadoop
 
Apache hadoop, hdfs and map reduce Overview
Apache hadoop, hdfs and map reduce OverviewApache hadoop, hdfs and map reduce Overview
Apache hadoop, hdfs and map reduce Overview
 
Big data and hadoop
Big data and hadoopBig data and hadoop
Big data and hadoop
 

Recently uploaded

Indian Call Girls in Abu Dhabi O5286O24O8 Call Girls in Abu Dhabi By Independ...
Indian Call Girls in Abu Dhabi O5286O24O8 Call Girls in Abu Dhabi By Independ...Indian Call Girls in Abu Dhabi O5286O24O8 Call Girls in Abu Dhabi By Independ...
Indian Call Girls in Abu Dhabi O5286O24O8 Call Girls in Abu Dhabi By Independ...dajasot375
 
Generative AI for Social Good at Open Data Science East 2024
Generative AI for Social Good at Open Data Science East 2024Generative AI for Social Good at Open Data Science East 2024
Generative AI for Social Good at Open Data Science East 2024Colleen Farrelly
 
Call Girls in Defence Colony Delhi 💯Call Us 🔝8264348440🔝
Call Girls in Defence Colony Delhi 💯Call Us 🔝8264348440🔝Call Girls in Defence Colony Delhi 💯Call Us 🔝8264348440🔝
Call Girls in Defence Colony Delhi 💯Call Us 🔝8264348440🔝soniya singh
 
Predicting Salary Using Data Science: A Comprehensive Analysis.pdf
Predicting Salary Using Data Science: A Comprehensive Analysis.pdfPredicting Salary Using Data Science: A Comprehensive Analysis.pdf
Predicting Salary Using Data Science: A Comprehensive Analysis.pdfBoston Institute of Analytics
 
DBA Basics: Getting Started with Performance Tuning.pdf
DBA Basics: Getting Started with Performance Tuning.pdfDBA Basics: Getting Started with Performance Tuning.pdf
DBA Basics: Getting Started with Performance Tuning.pdfJohn Sterrett
 
PKS-TGC-1084-630 - Stage 1 Proposal.pptx
PKS-TGC-1084-630 - Stage 1 Proposal.pptxPKS-TGC-1084-630 - Stage 1 Proposal.pptx
PKS-TGC-1084-630 - Stage 1 Proposal.pptxPramod Kumar Srivastava
 
EMERCE - 2024 - AMSTERDAM - CROSS-PLATFORM TRACKING WITH GOOGLE ANALYTICS.pptx
EMERCE - 2024 - AMSTERDAM - CROSS-PLATFORM  TRACKING WITH GOOGLE ANALYTICS.pptxEMERCE - 2024 - AMSTERDAM - CROSS-PLATFORM  TRACKING WITH GOOGLE ANALYTICS.pptx
EMERCE - 2024 - AMSTERDAM - CROSS-PLATFORM TRACKING WITH GOOGLE ANALYTICS.pptxthyngster
 
RS 9000 Call In girls Dwarka Mor (DELHI)⇛9711147426🔝Delhi
RS 9000 Call In girls Dwarka Mor (DELHI)⇛9711147426🔝DelhiRS 9000 Call In girls Dwarka Mor (DELHI)⇛9711147426🔝Delhi
RS 9000 Call In girls Dwarka Mor (DELHI)⇛9711147426🔝Delhijennyeacort
 
Customer Service Analytics - Make Sense of All Your Data.pptx
Customer Service Analytics - Make Sense of All Your Data.pptxCustomer Service Analytics - Make Sense of All Your Data.pptx
Customer Service Analytics - Make Sense of All Your Data.pptxEmmanuel Dauda
 
办理学位证纽约大学毕业证(NYU毕业证书)原版一比一
办理学位证纽约大学毕业证(NYU毕业证书)原版一比一办理学位证纽约大学毕业证(NYU毕业证书)原版一比一
办理学位证纽约大学毕业证(NYU毕业证书)原版一比一fhwihughh
 
1:1定制(UQ毕业证)昆士兰大学毕业证成绩单修改留信学历认证原版一模一样
1:1定制(UQ毕业证)昆士兰大学毕业证成绩单修改留信学历认证原版一模一样1:1定制(UQ毕业证)昆士兰大学毕业证成绩单修改留信学历认证原版一模一样
1:1定制(UQ毕业证)昆士兰大学毕业证成绩单修改留信学历认证原版一模一样vhwb25kk
 
Effects of Smartphone Addiction on the Academic Performances of Grades 9 to 1...
Effects of Smartphone Addiction on the Academic Performances of Grades 9 to 1...Effects of Smartphone Addiction on the Academic Performances of Grades 9 to 1...
Effects of Smartphone Addiction on the Academic Performances of Grades 9 to 1...limedy534
 
9654467111 Call Girls In Munirka Hotel And Home Service
9654467111 Call Girls In Munirka Hotel And Home Service9654467111 Call Girls In Munirka Hotel And Home Service
9654467111 Call Girls In Munirka Hotel And Home ServiceSapana Sha
 
专业一比一美国俄亥俄大学毕业证成绩单pdf电子版制作修改
专业一比一美国俄亥俄大学毕业证成绩单pdf电子版制作修改专业一比一美国俄亥俄大学毕业证成绩单pdf电子版制作修改
专业一比一美国俄亥俄大学毕业证成绩单pdf电子版制作修改yuu sss
 
Call Girls In Dwarka 9654467111 Escorts Service
Call Girls In Dwarka 9654467111 Escorts ServiceCall Girls In Dwarka 9654467111 Escorts Service
Call Girls In Dwarka 9654467111 Escorts ServiceSapana Sha
 
From idea to production in a day – Leveraging Azure ML and Streamlit to build...
From idea to production in a day – Leveraging Azure ML and Streamlit to build...From idea to production in a day – Leveraging Azure ML and Streamlit to build...
From idea to production in a day – Leveraging Azure ML and Streamlit to build...Florian Roscheck
 
dokumen.tips_chapter-4-transient-heat-conduction-mehmet-kanoglu.ppt
dokumen.tips_chapter-4-transient-heat-conduction-mehmet-kanoglu.pptdokumen.tips_chapter-4-transient-heat-conduction-mehmet-kanoglu.ppt
dokumen.tips_chapter-4-transient-heat-conduction-mehmet-kanoglu.pptSonatrach
 
NLP Data Science Project Presentation:Predicting Heart Disease with NLP Data ...
NLP Data Science Project Presentation:Predicting Heart Disease with NLP Data ...NLP Data Science Project Presentation:Predicting Heart Disease with NLP Data ...
NLP Data Science Project Presentation:Predicting Heart Disease with NLP Data ...Boston Institute of Analytics
 
办理(Vancouver毕业证书)加拿大温哥华岛大学毕业证成绩单原版一比一
办理(Vancouver毕业证书)加拿大温哥华岛大学毕业证成绩单原版一比一办理(Vancouver毕业证书)加拿大温哥华岛大学毕业证成绩单原版一比一
办理(Vancouver毕业证书)加拿大温哥华岛大学毕业证成绩单原版一比一F La
 
Amazon TQM (2) Amazon TQM (2)Amazon TQM (2).pptx
Amazon TQM (2) Amazon TQM (2)Amazon TQM (2).pptxAmazon TQM (2) Amazon TQM (2)Amazon TQM (2).pptx
Amazon TQM (2) Amazon TQM (2)Amazon TQM (2).pptxAbdelrhman abooda
 

Recently uploaded (20)

Indian Call Girls in Abu Dhabi O5286O24O8 Call Girls in Abu Dhabi By Independ...
Indian Call Girls in Abu Dhabi O5286O24O8 Call Girls in Abu Dhabi By Independ...Indian Call Girls in Abu Dhabi O5286O24O8 Call Girls in Abu Dhabi By Independ...
Indian Call Girls in Abu Dhabi O5286O24O8 Call Girls in Abu Dhabi By Independ...
 
Generative AI for Social Good at Open Data Science East 2024
Generative AI for Social Good at Open Data Science East 2024Generative AI for Social Good at Open Data Science East 2024
Generative AI for Social Good at Open Data Science East 2024
 
Call Girls in Defence Colony Delhi 💯Call Us 🔝8264348440🔝
Call Girls in Defence Colony Delhi 💯Call Us 🔝8264348440🔝Call Girls in Defence Colony Delhi 💯Call Us 🔝8264348440🔝
Call Girls in Defence Colony Delhi 💯Call Us 🔝8264348440🔝
 
Predicting Salary Using Data Science: A Comprehensive Analysis.pdf
Predicting Salary Using Data Science: A Comprehensive Analysis.pdfPredicting Salary Using Data Science: A Comprehensive Analysis.pdf
Predicting Salary Using Data Science: A Comprehensive Analysis.pdf
 
DBA Basics: Getting Started with Performance Tuning.pdf
DBA Basics: Getting Started with Performance Tuning.pdfDBA Basics: Getting Started with Performance Tuning.pdf
DBA Basics: Getting Started with Performance Tuning.pdf
 
PKS-TGC-1084-630 - Stage 1 Proposal.pptx
PKS-TGC-1084-630 - Stage 1 Proposal.pptxPKS-TGC-1084-630 - Stage 1 Proposal.pptx
PKS-TGC-1084-630 - Stage 1 Proposal.pptx
 
EMERCE - 2024 - AMSTERDAM - CROSS-PLATFORM TRACKING WITH GOOGLE ANALYTICS.pptx
EMERCE - 2024 - AMSTERDAM - CROSS-PLATFORM  TRACKING WITH GOOGLE ANALYTICS.pptxEMERCE - 2024 - AMSTERDAM - CROSS-PLATFORM  TRACKING WITH GOOGLE ANALYTICS.pptx
EMERCE - 2024 - AMSTERDAM - CROSS-PLATFORM TRACKING WITH GOOGLE ANALYTICS.pptx
 
RS 9000 Call In girls Dwarka Mor (DELHI)⇛9711147426🔝Delhi
RS 9000 Call In girls Dwarka Mor (DELHI)⇛9711147426🔝DelhiRS 9000 Call In girls Dwarka Mor (DELHI)⇛9711147426🔝Delhi
RS 9000 Call In girls Dwarka Mor (DELHI)⇛9711147426🔝Delhi
 
Customer Service Analytics - Make Sense of All Your Data.pptx
Customer Service Analytics - Make Sense of All Your Data.pptxCustomer Service Analytics - Make Sense of All Your Data.pptx
Customer Service Analytics - Make Sense of All Your Data.pptx
 
办理学位证纽约大学毕业证(NYU毕业证书)原版一比一
办理学位证纽约大学毕业证(NYU毕业证书)原版一比一办理学位证纽约大学毕业证(NYU毕业证书)原版一比一
办理学位证纽约大学毕业证(NYU毕业证书)原版一比一
 
1:1定制(UQ毕业证)昆士兰大学毕业证成绩单修改留信学历认证原版一模一样
1:1定制(UQ毕业证)昆士兰大学毕业证成绩单修改留信学历认证原版一模一样1:1定制(UQ毕业证)昆士兰大学毕业证成绩单修改留信学历认证原版一模一样
1:1定制(UQ毕业证)昆士兰大学毕业证成绩单修改留信学历认证原版一模一样
 
Effects of Smartphone Addiction on the Academic Performances of Grades 9 to 1...
Effects of Smartphone Addiction on the Academic Performances of Grades 9 to 1...Effects of Smartphone Addiction on the Academic Performances of Grades 9 to 1...
Effects of Smartphone Addiction on the Academic Performances of Grades 9 to 1...
 
9654467111 Call Girls In Munirka Hotel And Home Service
9654467111 Call Girls In Munirka Hotel And Home Service9654467111 Call Girls In Munirka Hotel And Home Service
9654467111 Call Girls In Munirka Hotel And Home Service
 
专业一比一美国俄亥俄大学毕业证成绩单pdf电子版制作修改
专业一比一美国俄亥俄大学毕业证成绩单pdf电子版制作修改专业一比一美国俄亥俄大学毕业证成绩单pdf电子版制作修改
专业一比一美国俄亥俄大学毕业证成绩单pdf电子版制作修改
 
Call Girls In Dwarka 9654467111 Escorts Service
Call Girls In Dwarka 9654467111 Escorts ServiceCall Girls In Dwarka 9654467111 Escorts Service
Call Girls In Dwarka 9654467111 Escorts Service
 
From idea to production in a day – Leveraging Azure ML and Streamlit to build...
From idea to production in a day – Leveraging Azure ML and Streamlit to build...From idea to production in a day – Leveraging Azure ML and Streamlit to build...
From idea to production in a day – Leveraging Azure ML and Streamlit to build...
 
dokumen.tips_chapter-4-transient-heat-conduction-mehmet-kanoglu.ppt
dokumen.tips_chapter-4-transient-heat-conduction-mehmet-kanoglu.pptdokumen.tips_chapter-4-transient-heat-conduction-mehmet-kanoglu.ppt
dokumen.tips_chapter-4-transient-heat-conduction-mehmet-kanoglu.ppt
 
NLP Data Science Project Presentation:Predicting Heart Disease with NLP Data ...
NLP Data Science Project Presentation:Predicting Heart Disease with NLP Data ...NLP Data Science Project Presentation:Predicting Heart Disease with NLP Data ...
NLP Data Science Project Presentation:Predicting Heart Disease with NLP Data ...
 
办理(Vancouver毕业证书)加拿大温哥华岛大学毕业证成绩单原版一比一
办理(Vancouver毕业证书)加拿大温哥华岛大学毕业证成绩单原版一比一办理(Vancouver毕业证书)加拿大温哥华岛大学毕业证成绩单原版一比一
办理(Vancouver毕业证书)加拿大温哥华岛大学毕业证成绩单原版一比一
 
Amazon TQM (2) Amazon TQM (2)Amazon TQM (2).pptx
Amazon TQM (2) Amazon TQM (2)Amazon TQM (2).pptxAmazon TQM (2) Amazon TQM (2)Amazon TQM (2).pptx
Amazon TQM (2) Amazon TQM (2)Amazon TQM (2).pptx
 

Big Data Intro - Hadoop & Definitions

  • 1. Big Data A brief introduction into Big Data & Hadoop 01/01/16 F. v. Noort
  • 2. Big Data – A definition • Big data usually includes data sets (both structured and unstructured) with sizes beyond the ability of commonly used software tools to capture, curate, manage, and process data within a tolerable elapsed time. • Doug Laney (2001) 3V’s: “data growth challenges and opportunities defined as being three- dimensional, i.e. increasing Volume (amount of data), Velocity (speed of data in and out), and Variety (range of data types and sources)” Big Data – A brief introduction01/01/16 F. v. Noort 1
  • 3. Big Data – A definition • Gartner (2012): "Big data is high volume, high velocity, and/or high variety information assets that require new forms of processing to enable enhanced decision making, insight discovery and process optimization." Big Data – A brief introduction01/01/16 F. v. Noort 2
  • 4. Big Data - Characterization The original 3V’s have been expanded by the following more complete set of characteristics: • Volume: the quantity of generated & stored data • Velocity: the speed at which is generated & processed • Variety: the type and nature of the data • Variability: Inconsistency of the data set can hamper processes to handle and manage it • Veracity: The quality of captured data can vary greatly, affecting accurate analysis. • Complexity: Managing data coming from multiple sources can be very challenging. Data must be linked, connected, and correlated so users can query and process it effectively. Big Data – A brief introduction01/01/16 F. v. Noort 3
  • 5. Difference Big Data versus BI • Business Intelligence uses descriptive statistics with high information density data to measure things, detect trends, etc. • Big data (analytics) uses inductive statistics and concepts from nonlinear system identification to infer laws (regressions, nonlinear relationships, and causal effects) from large sets of data with low information density to reveal relationships and dependencies, or to perform predictions of outcomes and behaviors Big Data – A brief introduction01/01/16 F. v. Noort 4
  • 6. Architecture: Client Server Big Data – A brief introduction01/01/16 F. v. Noort 5 Server Client ClientClientClient Client Client ClientClientClient Client Client’s can always overwhelm the system!
  • 7. Architecture: Storage Area Network Big Data – A brief introduction01/01/16 F. v. Noort 6 Central Contact Point ServerServerServerServer Client Client Client Client Client Client Client’s can always overwhelm the system!
  • 8. Architecture: Google File System (GFS) • Instead of having a giant file storage appliance sitting in the back end, use industry standard hardware on a large scale • Drive high performance through the shear number of components • Reliability through redundancy & replication • Computation work is done there where the data is 01/01/16 F. v. Noort Big Data – A brief introduction 7 Storage Compute Storage Compute Storage Compute Storage Compute Storage Compute Storage Compute Storage Compute Storage Compute Storage Compute Storage Compute Storage Compute Storage Compute Storage Compute Storage Compute Storage Compute Storage Compute
  • 9. Hadoop • Based on work from Google File System + MapReduce • Doug Cutting & Mike Cafarella created there own version: Hadoop (named after Doug’s son toy elephant) • Current distributions based on Open Source and Vendor Work – Apache Hadoop – Cloudera - CH4 w/ Impala – Hortonworks – MapR – AWS – Windows Azure HDInsight 01/01/16 F. v. Noort Big Data – A brief introduction 8
  • 10. Why use Hadoop? • Scalability: Scales to Petabytes or more • Fault tolerant • Faster: Parallel data processing • Better: Suited for particular types of BigData problems • Open source • Low cost: can be deployed on commodity hardware 01/01/16 F. v. Noort Big Data – A brief introduction 9
  • 11. Hadoop Core Architecture Hadoop core comprises of a • Distributed File System  HDFS: Hadoop Distributed File System (based on GFS)  File Sharing & Data Protection Across Physical Servers • Processing paradigm  MapReduce  Distributed Computing Across Physical Servers 01/01/16 F. v. Noort Big Data – A brief introduction 10 MapReduce HDFS
  • 12. HDFS (1/2) Hadoop Distributed File System • Written in JAVA • On top of native filing system • Designed to handle very large files with streaming data access patterns • Uses blocks to store a file or parts of a file/ splitting of large files into blocks • Build on X86-standards  Lot’s of flexibility: reference architectures for many type of servers 01/01/16 F. v. Noort Big Data – A brief introduction 11 Hadoop Distributed File System X86 X86 X86 X86
  • 13. HDFS (2/2) • HDFS File Blocks – 64Mb (default), 128 Mb (recommended) – 1HDFS block is supported by multiple operations system (OS) blocks • Blocks are replicated (default 3x) to multiple nodes • Allows for node failure without data loss • Two key services – Master NameNode – Many DataNodes • Checkpoint Node (Secondary NameNode) 01/01/16 F. v. Noort Big Data – A brief introduction 12
  • 14. MapReduce “take your task which is data oriented, chunk it up and distribute it on the network such that every piece of work is done within the network by the machine that has the piece of data that needs to be worked on” MapReduce • Processing paradigm that pairs with HDFS • Distributed computation algorithm that pushes the compute down to each of the X86 servers • Fault tolerant • Parallelized (scalable) processing • Combination of a Map- and a Reduce procedure: – Map procedure: performs filtering and sorting of the data – Reduce procedure: performs summary operations 01/01/16 F. v. Noort Big Data – A brief introduction 13
  • 15. Other Hadoop tools/frameworks • Data Access: – Hive, Pig, Mahout • Tools – Sqoop, Flume 01/01/16 F. v. Noort Big Data – A brief introduction 14
  • 16. Hadoop Architecture Main nodes of Hadoop • Hadoop Distribute Files System (HDFS) nodes – NameNode – DataNode • MapReduce nodes – JobTracker – TaskTracker 01/01/16 F. v. Noort Big Data – A brief introduction 15
  • 17. HDFS - NameNode • Single master service for HDFS • Single point of failure (HDFS 1.x) • Stores file to block to location mappings in the namespace (manages the file system namespace and metadata) • Don’t use inexpensive commodity hardware for this node • Large memory requirements, keeps the entire file system metadata in memory 01/01/16 F. v. Noort Big Data – A brief introduction 16
  • 18. HDFS - DataNode • Many per Hadoop cluster • Stores blocks on local disk • Manages blocks with data and serves them to clients • Checksums on blocks => fault tolerant data store system • Clients connect to DataNode for I/O • Sends frequent heartbeats (pings “hey I’m alive” for about every 2 seconds) to NameNode • Sends block reports to NameNode 01/01/16 F. v. Noort Big Data – A brief introduction 17
  • 19. HDFS Write operation 01/01/16 F. v. Noort Big Data – A brief introduction 18 HDFS Client File 1 Block 1 Block 2 Block 3 Rack 2 DataNode 7 DataNode 8 DataNode 9 DataNode 12 DataNode 10 DataNode 11 Rack 3 DataNode 13 DataNode 14 DataNode 15 DataNode 18 DataNode 16 DataNode 17 Rack 1 DataNode 1 DataNode 2 DataNode 3 DataNode 6 DataNode 4 DataNode 5 NameNode Client divides file in blocks Client contacts name node to write data NameNode says write it to these nodes (DN1, DN7, DN15) Block 1 Block 1 Block 1 Block 2 Block 2Block 2 Block 3 Block 3 Block 3 • DataNodes replicate data blocks, orchestrated by the NameNode • Default 3 replica’s • Rack-aware system!
  • 20. HDFS Read operation 01/01/16 F. v. Noort Big Data – A brief introduction 19 HDFS Client Rack 2 DataNode 7 DataNode 8 DataNode 9 DataNode 12 DataNode 10 DataNode 11 Rack 3 DataNode 13 DataNode 14 DataNode 15 DataNode 18 DataNode 16 DataNode 17 Rack 1 DataNode 1 DataNode 2 DataNode 3 DataNode 6 DataNode 4 DataNode 5 NameNode Client divides file in blocks Client contacts name node to read data NameNode says find it on these nodes Block 1 Block 1 Block 1 Block 2 Block 2Block 2 Block 3 Block 3 Block 3
  • 21. HDFS 2.0 Features • NameNode High-Availability – Two redundant NameNodes in active/passive configuration – Manual or automated failover • NameNode Federation – Multiple independent NameNodes using the same collection of DataNodes 01/01/16 F. v. Noort Big Data – A brief introduction 20
  • 22. Hadoop MapReduce • Moves the code to the data • JobTracker – Master service to monitor jobs • TaskTracker – Multiple services to run tasks – Same physical machine as a DataNode • A job contains many tasks • A task contains one or more task attempts 01/01/16 F. v. Noort Big Data – A brief introduction 21
  • 23. MapReduce JobTracker • One per Hadoop cluster • Receives job requests submitted by Client • Schedules jobs in FIFO order • Schedules & monitors MapReduce jobs on task trackers • Issues task attempts to TaskTrackers • Single point of failure for MapReduce 01/01/16 F. v. Noort Big Data – A brief introduction 22
  • 24. MapReduce TaskTracker • Runs on same node as DataNode service • Many per Hadoop cluster • Sends heartbeats and task reports to JobTracker • Executes MapReduce operations • Configurable number of map and reduce slots • Runs map and reduce task attempts 01/01/16 F. v. Noort Big Data – A brief introduction 23
  • 25. HDFS Architecture: Master & Slave 01/01/16 F. v. Noort Big Data – A brief introduction 24 HDFS Client Secondary NameNodeNameNodeJobTracker Note • Hadoop 1.0 has only 1 NameNode • Hadoop 2.0 has active & passive NameNode DataNode TaskTracker DataNode TaskTracker DataNode TaskTracker MapReduce Distributed Data Processing
  • 26. How MapReduce works (1/3) • MapReduce is a combination of a Map- and a Reduce procedure: – Map procedure: performs filtering and sorting of the data – Reduce procedure: performs summary operations 01/01/16 F. v. Noort Big Data – A brief introduction 25
  • 27. How MapReduce works (2/3) 01/01/16 F. v. Noort Big Data – A brief introduction 26 CustId, ZipCode, Amount 4 6654FD €75 7 1534CD €60 2 5734CD €30 1 1184AN €15 5 5734CD €65 0 6654FD €22 5 5734CD €15 6 4484AN €10 3 1534CD €95 8 4484AN €55 6 4484AN €25 9 1184AN €15 Mapper 1 Mapper 2 2 Map Jobs Scenario: Get sum sales grouped by ZipCode 6654FD €75 1534CD €60 5734CD €30 1184AN €15 5734CD €65 6654FD €22 5734CD €15 4484AN €10 1534CD €95 4484AN €55 4484AN €25 1184AN €15 Map Phase
  • 28. How MapReduce works (3/3) 01/01/16 F. v. Noort Big Data – A brief introduction 27 6654FD €75 1534CD €60 5734CD €30 1184AN €15 5734CD €65 6654FD €22 5734CD €15 4484AN €10 1534CD €95 4484AN €55 4484AN €25 1184AN €15 5734CD €65 5734CD €30 5734CD €15 4484AN €10 4484AN €25 1534CD €60 1534CD €95 4484AN €55 6654FD €75 6654FD €22 1184AN €15 1184AN €15 5734CD €65 5734CD €30 5734CD €15 4484AN €10 4484AN €25 1534CD €60 1534CD €95 4484AN €55 6654FD €75 6654FD €22 1184AN €15 1184AN €15 5734CD €110 1534CD €155 4484AN €90 1184AN €30 6654FD €97 Reducer Reducer Reducer Scenario: Get sum sales grouped by ZipCode ShufflePhase Sort Sum
  • 29. The Hadoop Ecosystem • Data Access: – Hive – Pig – Mahout • Tools – Sqoop – Flume 01/01/16 F. v. Noort Big Data – A brief introduction 28
  • 30. Hive • Declarative language • Allows users to write write SQL-like queries (no ANSI SQL) • Analytics area • Structures data • Data in Tables • Tables will remain 01/01/16 F. v. Noort Big Data – A brief introduction 29 MapReduce HDFS Hive
  • 31. PIG • Procedural language (PigLatin) • Generates one or more MapReduce jobs • Efficiency in computing • Structured/unstructured data • Data in Variables • May not retain values 01/01/16 F. v. Noort Big Data – A brief introduction 30 MapReduce HDFS Hive PIG
  • 32. Mahout • Library for scalable machine learning (written in Java) • Classification, Clustering, Pattern Mining, etc .. 01/01/16 F. v. Noort Big Data – A brief introduction 31 MapReduce HDFS Hive PIG Mahout
  • 33. Sqoop • To transfer data to and from a relational database • Compression of data is a feature 01/01/16 F. v. Noort Big Data – A brief introduction 32 MapReduce HDFS Hive PIG Mahout Sqoop
  • 34. Flume • An application that allows to move streaming data to a Hadoop cluster • A massively distributable framework for event based data 01/01/16 F. v. Noort Big Data – A brief introduction 33 MapReduce HDFS Hive PIG Mahout Sqoop Flume