2. WHAT IS BIG DATA?
THE TERM BIG DATA STANDS FOR COLLECTION OF DATA SETS THAT ARE TOO
LARGE AND COMPLEX ,SO IT IS DIFFICULT TO CAPTURE , STORE , SEARCH AND
ANALYZE USING TRADITIONAL DATA PROCESSING APPLICATIONS.
BIG DATA = SORTED DATA + UNSORTED DATA
SORTED DATA
UNSORTED DATA
3. CHARACTERISTICS OF BIG DATA
3VS (VOLUME, VARIETY AND VELOCITY) ARE DEFINING PROPERTIES OR
DIMENSIONS OF BIG DATA.
VOLUME REFERS TO THE AMOUNT OF DATA.
VARIETY REFERS TO THE NUMBER OF TYPES OF DATA.
VELOCITY REFERS TO THE SPEED OF DATA PROCESSING.
6. SO HADOOP IS..
• A PRODUCT OF APACHE SOFTWARE FOUNDATION.
• A SOFTWARE FRAMEWORK WRITTEN IN JAVA.
• IT SUPPORTS CROSS-PLATFORM.
• IT IS OPEN SOURCE.
HADOOP FRAMEWORK IS BUILT OF :
1. HADOOP COMMON
2. HDFS
3. HADOOP YARN
4. MAPREDUCE
7. HDFS
IT IS A SPECIALLY DESIGN FILE SYSTEM FOR STORING HUGE DATA SETS WITH
CLUSTER OF COMMODITY HARDWARE STREAMING ACCESS PLATFORM.
• CLUSTER
• COMMODITY HARDWARE
• STREAMING ACCESS PLATFORM
• SPECIALLY DESIGN FILE SYSTEM
8. 5 SERVICES PROVIDED BY HDFS
• NAME NODE
• SECONDARY NAME NODE
• JOB TRACKER
• DATA NODE
• TASK TRACKER
Name node
Secondary name node
Job tracker
Data node
Task tracker
12. REQUIREMENT FOR INSTALLATION
o JAVA 1.6.X , PREFERABLY FROM SUN MUSTBE INSTALLED
o SSH MUST BE INSTALLED AND SSHD MUST BE RUNNING TO USE THE HADOOP SCRIPTS THAT
MANAGE REMOTE HADOOP DAEMONS
o INSTALL HADOOP-2.3.0 AND HADOOP-2.3-CONFIG-MASTER
o WWW.HADOOP.APACHE.ORG
21. FLOW CHART OF WORD COUNT JOB
FILE.TXT 200MB
Input File(File.txt)
Input Split Input Split Input Split Input Split
Mapper Mapper Mapper Mapper
64mb
64mb
64mb
8mb
Record
Reader
Record
Reader
Record
Reader
Record
Reader
(byteoffset , entireline)
(0 , hi how are you?)
(17 , how is your job?)
(how,1)(what,1)
(is,1)(your,1)
(how,1)(is,1)
(brother,1)(now,1)