SlideShare a Scribd company logo
Hadoop HDFS
 What is Big Data?
 Limitations of the existing solutions
 Solving the problem with Hadoop
 Introduction to Hadoop
 Hadoop Eco-System
 Hadoop Core Components
 HDFS Architecture
 Anatomy of a File Write and Read
Topics of the Day
Slide 2
What is Big Data?
Slide 3
 Lots of Data (Terabytes or Petabytes)
 Big data is the term for a collection of data sets so
large and complex that it becomes difficult to
process using on-hand database management
tools or traditional data processing applications.
 The challenges include capture, curation,
storage, search, sharing, transfer, analysis, and
visualization.
cloud
tools
statistics
No SQL
compression
storage
support
database
analyze
information
terabytes
mobile
processing
Big Data
What is Big Data?
Slide 4
 Systems / Enterprises generate huge amount of data from Terabytes to and even Petabytes of information.
NYSE generates about one terabyte of new
trade data per day to perform stock trading
analytics to determine trends for optimal
trades.
Un – Structured Data is Exploding
Slide 5
 IBM’s Definition – Big Data Characteristics
http://www-01.ibm.com/software/data/bigdata/
Web
logs
Videos
Images
Audios
Sensor
Data
Slide 6
Volume Velocity Variety
IBM’s Definition
 Web and e-tailing
 Recommendation Engines
 Ad Targeting
 Search Quality
 Abuse and Click Fraud Detection
 Telecommunications
 Customer Churn Prevention
 Network Performance Optimization
 Calling Data Record (CDR) Analysis
 Analyzing Network to Predict Failure
Common Big Data Customer Scenarios
Slide 7
 Government
 Fraud Detection and Cyber Security
 Welfare Schemes
 Justice
 Healthcare & Life Sciences
 Health Information Exchange
 Gene Sequencing
 Serialization
 Healthcare Service Quality Improvements
 Drug Safety
Common Big Data Customer Scenarios(Contd.)
Slide 8
Common Big Data Customer Scenarios Contd.)
Slide 9
 Banks and Financial services
 Modeling True Risk
 Threat Analysis
 Fraud Detection
 Trade Surveillance
 Credit Scoring and Analysis
 Retail
 Point of Sales Transaction Analysis
 Customer Churn Analysis
 Sentiment Analysis
Hidden Treasure
Case Study: Sears Holding Corporation
X
*Sears was using traditional systems such as Oracle Exadata,
Teradata and SAS etc. to store and process the customer activity
and sales data.
Slide 17
 Insight into data can provide Business Advantage.
 Some key early indicators can mean Fortunes to Business.
 More Precise Analysis with more data.
Mostly Append
Collection
Instrumentation
http://www.informationweek.com/it-leadership/why-sears-is-going-all-in-on-hadoop/d/d-id/1107038?
90% of
the ~2PB
Archived
Storage
Processing
BI Reports + Interactive Apps
RDBMS (Aggregated Data)
ETL Compute Grid
3. Premature data
death
1. Can’t explore original
high fidelity raw data
2. Moving data to compute
doesn’t scale
A meagre
10% of the
~2PB Data is
available for
BI
Storage only Grid (original Raw Data)
Limitations of Existing Data Analytics
Architecture
Slide 11
*Sears moved to a 300-Node Hadoop cluster to keep 100% of its data available for processing rather
than a meagre 10% as was the case with existing Non-Hadoop solutions.
RDBMS (Aggregated Data)
No Data
Archiving
1. Data Exploration &
Advanced analytics
2. Scalable throughput for ETL &
aggregation
3. Keep data alive
forever
Mostly Append
Collection
Instrumentation
BI Reports + Interactive Apps
Hadoop : Storage + Compute Grid
Entire ~2PB
Data is
available for
processing
Both
Storage
And
Processing
Solution: A Combined Storage Computer Layer
Slide 12
Accessible
RobustSimple
Scalable
Differentiating
Factors
Hadoop Differentiating Factors
Slide 13
RDBMS EDW
Hadoop – It’s about Scale and Structure
Slide 14
MPP NoSQL HADOOP
Structured Data Types Multi and Unstructured
Limited, No Data Processing Processing Processing coupled with Data
Standards & Structured Governance Loosely Structured
Required On write Schema Required On Read
Software License Cost Support Only
Known Entity Resources Growing, Complexities, Wide
Interactive OLAP Analytics
Operational Data Store
Best Fit Use Massive Storage/Processing
Read 1 TB Data
1 Machine
4 I/O Channels
Each Channel – 100 MB/s
10 Machines
4 I/O Channels
Each Channel – 100 MB/s
Why DFS?
Slide 15
Read 1 TB Data
10 Machines
4 I/O Channels
Each Channel – 100 MB/s
Why DFS?
Slide 16
45 Minutes
1 Machine
4 I/O Channels
Each Channel – 100 MB/s
Read 1 TB Data
10 Machines
4 I/O Channels
Each Channel – 100 MB/s
Why DFS?
Slide 17
45 Minutes 4.5 Minutes
1 Machine
4 I/O Channels
Each Channel – 100 MB/s
What is Hadoop?
Slide 18
 Apache Hadoop is a framework that allows for the distributed processing of large data sets across
clusters of commodity computers using a simple programming model.
 It is an Open-source Data Management with scale-out storage & distributed processing.
Apache Oozie (Workflow)
MapReduce Framework
HBase
HDFS (Hadoop Distributed File System)
Flume Sqoop
Import Or Export
Pig Latin
Data Analysis
Mahout
Machine Learning
Hive
DW System
Unstructured or
Semi-Structured data Structured Data
Hadoop Eco-System
Slide 19
Data Node
Task
Tracker
Data Node
Task
Tracker
Data Node
Task
Tracker
Data Node
Task
Tracker
MapReduce
Engine
HDFS
Cluster
Job Tracker
Admin Node
Name node
Hadoop Core Components
Slide 20
Hadoop Core Components
Slide 21
Hadoop is a system for large scale data processing.
It has two main components:
 HDFS – Hadoop Distributed File System (Storage)
 Distributed across “nodes”
 Natively redundant
 NameNode tracks locations.
 MapReduce (Processing)
 Splits a task across processors
 “near” the data & assembles results
 Self-Healing, High Bandwidth
 Clustered storage
 JobTracker manages the TaskTrackers
HDFS Definition
Slide 22
The Hadoop Distributed File System (HDFS) is a distributed file system designed to run on commodity hardware.
HDFS is a distributed, scalable, and portable filesystem written in Java for the Hadoop framework.
It has many similarities with existing distributed file systems.
Hadoop Distributed File System (HDFS™) is the primary storage system used by Hadoop applications.
HDFS creates multiple replicas of data blocks and distributes them on compute nodes throughout a cluster to
enable reliable, extremely rapid computations.
HDFS is highly fault-tolerant and is designed to be deployed on low-cost hardware.
HDFS provides high throughput access to application data and is suitable for applications that have large data sets
• HDFS consists of following components (daemons)
– Name Node
– Data Node
– Secondary Name Node
HDFS Components
Slide 23
Namenode:
NameNode, a master server, manages the file system namespace and regulates
access to files by clients.
Maintains and Manages the blocks which are present on the datanode.
 Meta-data in Memory
– The entire metadata is in main memory
 Types of Metadata
– List of files
– List of Blocks for each file
– List of DataNodes for each block
– File attributes, e.g creation time, replication factor
 A Transaction Log
– Records file creations, file deletions. Etc
HDFS Components
Slide 24
Data Node: Slaves which are deployed on each mechine and provide
the actual storage.
DataNodes, one per node in the cluster, manages storage attached to the
nodes that they run on
 A Block Server
– Stores data in the local file system (e.g. ext3)
– Stores meta-data of a block (e.g. CRC)
– Serves data and meta-data to Clients
– Block Report
– Periodically sends a report of all existing blocks to the NameNode
 Facilitates Pipelining of Data
– Forwards data to other specified DataNodes
Understanding the File system
Block placement
• Current Strategy
 One replica on local node
 Second replica on a remote rack
 Third replica on same remote rack
 Additional replicas are randomly placed
• Clients read from nearest replica
Data Correctness
• Use Checksums to validate data
 Use CRC32
• File Creation
 Client computes checksum per 512 byte
 DataNode stores the checksum
• File access
 Client retrieves the data and checksum from DataNode
Understanding the File system
Data pipelining
 Client retrieves a list of DataNodes on which to place replicas of a block
 Client writes block to the first DataNode
 The first DataNode forwards the data to the next DataNode in the Pipeline
 When all replicas are written, the Client moves on to write the next block in file
Rebalancer
• Goal: % of disk occupied on Datanodes should be similar
 Usually run when new Datanodes are added
 Cluster is online when Rebalancer is active
 Rebalancer is throttled to avoid network congestion
 Command line tool
Secondary NameNode
Slide 27
 Secondary NameNode:
Not a hot standby for the NameNode
Connects to NameNode every hour*
Housekeeping, backup of NemeNode metadata
Saved metadata can build a failed NameNode
Metadata (Name, replicas,…)
Rack 1
Blocks
DataNodes
Block ops
Replication
Write
DataNodes
Metadata ops
Client
Read
NameNode
Rack 2
Client
HDFS Architecture
Slide 28
NameNode
DataNode
DataNode
DataNode
DataNode
2. Create
7. Complete
5. ack Packet4. Write Packet
Pipeline of
Data nodes
DataNode
DataNode
Distributed
File System
HDFS
Client
1. Create
3. Write
6. Close
Anatomy of A File Write
Slide 29
NameNode
NameNode
DataNode
DataNode
DataNode
DataNode
2. Get Block locations
4. Read
DataNode
DataNode
Distributed
File System
HDFS
Client
1. Open
3. Read
5. Read
FS Data
Input Stream
6. Close
Client JVM
Client Node
Anatomy of A File Read
Slide 30
Replication and Rack Awareness
Slide 31
Task TrackerTask Tracker
Task TrackerTask TrackerData Node
Task Tracker
Client
HDFS
Name Node
Task Tracker Task Tracker Data Node Data Node
Map Reduce
Job Tracker
Hadoop Cluster Architecture (Contd.)
Data Node
Data Node
Task Tracker
Data Node
Data Node Data Node
Data Node
RAM: 16GB
Hard disk: 6 X 2TB
Processor: Xenon with 2 cores.
Ethernet: 3 X 10 GB/s
OS: 64-bit CentOS
Name Node
RAM: 64 GB,
Hard disk: 1 TB
Processor: Xenon with 8 Cores
Ethernet: 3 X 10 GB/s
OS: 64-bit CentOS
Power: Redundant Power Supply
Secondary Name Node
RAM: 32 GB,
Hard disk: 1 TB
Processor: Xenon with 4 Cores
Ethernet: 3 X 10 GB/s
OS: 64-bit CentOS
Power: Redundant Power Supply
Data Node
RAM: 16GB
Hard disk: 6 X 2TB
Processor: Xenon with 2 cores.
Ethernet: 3 X 10 GB/s
OS: 64-bit CentOS
Hadoop Cluster: A Typical Use Case
http://wiki.apache.org/hadoop/PoweredBy
Hadoop Cluster: Facebook
Hadoop can run in any of the following three modes:
Standalone (or Local) Mode
 No daemons, everything runs in a single JVM.
 Suitable for running MapReduce programs during development.
 Has no DFS.
Pseudo-Distributed Mode
 Hadoop daemons run on the local machine.
Fully-Distributed Mode
 Hadoop daemons run on a cluster of machines.
Hadoop Cluster Modes
Core
HDFS
core-site.xml
hdfs-site.xml
mapred-site.xml
Map
Reduce
Hadoop 1.x: Core Configuration Files
Hadoop Configuration Files
Configuration
Filenames
Description of Log Files
hadoop-env.sh Environment variables that are used in the scripts to run Hadoop.
core-site.xml Configuration settings for Hadoop Core such as I/O settings that are common to HDFS and
MapReduce.
hdfs-site.xml Configuration settings for HDFS daemons, the namenode, the secondary namenode and the data
nodes.
mapred-site.xml Configuration settings for MapReduce daemons : the job-tracker and the task-trackers.
masters A list of machines (one per line) that each run a secondary namenode.
slaves A list of machines (one per line) that each run a datanode and a task-tracker.
core-site.xml and hdfs-site.xml
hdfs-site.xml core-site.xml
<?xml version - "1.0"?> <?xml version ="1.0"?>
<!--hdfs-site.xml--> <!--core-site.xml-->
<configuration> <configuration>
<property> <property>
<name>dfs.replication</name> <name>fs.default.name</name>
<value>1</value> <value>hdfs://localhost:8020/</value>
</property> </property>
</configuration> </configuration>
Defining HDFS Details In hdfs-site.xml
Property Value Description
dfs.data.dir <value>
/disk1/hdfs/data,
/disk2/hdfs/data
</value>
A list of directories where the datanode
stores blocks. Each block is stored in only
one of these directories.
${hadoop.tmp.dir}/dfs/data
fs.checkpoint.dir <value>
/disk1/hdfs/namesecondary,
/disk2/hdfs/namesecondary
</value>
A list of directories where the secondary
namenode stores checkpoints. It stores a
copy of the checkpoint in each directory in
the list
${hadoop.tmp.dir}/dfs/namesecondary
mapred-site.xml
mapred-site.xml
<?xml version=“1.0”?>
<configuration>
<property>
<name>mapred.job.tracker</name>
<value>localhost:8021</value>
<property>
</configuration>
Defining mapred-site.xml
Property Value Description
mapred.job.tracker <value>
localhost:8021
</value>
The hostname and the port that the jobtracker
RPC server runs on. If set to the default value of
local, then the jobtracker runs in-process on
demand when you run a MapReduce job.
mapred.local.dir ${hadoop.tmp.dir}/mapred/local A list of directories where MapReduce stores
intermediate data for jobs. The data is cleared
out when the job ends.
mapred.system.dir ${hadoop.tmp.dir}/mapred/syste
m
The directory relative to fs.default.name where
shared files are stored, during a job run.
mapred.tasktracker.map.tasks.
maximum
2 The number of map tasks that may be run on a
tasktracker at any one time
mapred.tasktracker.reduce.tasks
.maximum
2 The number of reduce tasks tat may be run on a
tasktracker at any one time.
1. http://hadoop.apache.org/docs/r1.1.2/core-default.html
2. http://hadoop.apache.org/docs/r1.1.2/mapred-default.html
3. http://hadoop.apache.org/docs/r1.1.2/hdfs-default.html
All Properties
Masters
 Contains a list of hosts, one per line, that are to host Secondary
NameNode servers.
Slaves and Masters
Two files are used by the startup and shutdown commands:
Slaves
 Contains a list of hosts, one per line, that are to host DataNode and
TaskTracker servers.
 This file also offers a way to provide custom parameters for each of the servers.
 Hadoop-env.sh is sourced by all of the Hadoop Core scripts provided in the conf/ directory of the
installation.
 Examples of environment variables that you can specify:
export HADOOP_DATANODE_HEAPSIZE="128"
export HADOOP_TASKTRACKER_HEAPSIZE="512"
Set parameter JAVA_HOME
JVMhadoop-env.sh
Per-Process Runtime Environment
 NameNode status: http://localhost:50070/dfshealth.jsp
 JobTracker status: http://localhost:50030/jobtracker.jsp
 TaskTracker status: http://localhost:50060/tasktracker.jsp
 DataBlock Scanner Report: http://localhost:50075/blockScannerReport
Web UI URLs
Practice HDFS Commands
Open a terminal window to the current working directory.
# /home/notroot
-----------------------------------------------------------------
# 1. Print the Hadoop version
hadoop version
-----------------------------------------------------------------
# 2. List the contents of the root directory in HDFS
hadoop fs -ls /
-----------------------------------------------------------------
# 3. Report the amount of space used and available on currently mounted filesystem
hadoop fs -df hdfs:/
-----------------------------------------------------------------
# 4. Count the number of directories,files and bytes under the paths that match the
specified file pattern
hadoop fs -count hdfs:/
5. Run a DFS filesystem checking utility
hadoop fsck – /
-----------------------------------------------------------------
# 6. Run a cluster balancing utility
hadoop balancer
-----------------------------------------------------------------
# 7. Create a new directory named “hadoop” below the /user/training directory in
HDFS.
hadoop fs -mkdir /user/training/hadoop
-----------------------------------------------------------------
# 8. Add a sample text file from the local directory named “data” to the new
directory you created in HDFS
hadoop fs -put data/sample.txt /user/training/hadoop
9. List the contents of this new directory in HDFS.
hadoop fs -ls /user/training/hadoop
-----------------------------------------------------------------
# 10. Add the entire local directory called “retail” to the /user/training directory in
HDFS.
hadoop fs -put data/retail /user/training/hadoop
-----------------------------------------------------------------
# 11. Since /user/training is your home directory in HDFS, any command that does not
have an absolute path is interpreted as relative to that directory.
# The next command will therefore list your home directory, and should show the
items you’ve just added there.
hadoop fs -ls
-----------------------------------------------------------------
# 12. See how much space this directory occupies in HDFS.
hadoop fs -du -s -h hadoop/retail
13. Delete a file ‘customers’ from the “retail” directory.
hadoop fs -rm hadoop/retail/customers
-----------------------------------------------------------------
# 14. Ensure this file is no longer in HDFS.
hadoop fs -ls hadoop/retail/customers
-----------------------------------------------------------------
# 15. Delete all files from the “retail” directory using a wildcard.
hadoop fs -rm hadoop/retail/*
-----------------------------------------------------------------
# 16. To empty the trash
hadoop fs –expunge
-----------------------------------------------------------------
# 17. Finally, remove the entire retail directory and all of its contents in HDFS.
hadoop fs -rm -r hadoop/retail
# 18. Add the purchases.txt file from the local directory named “/home/training/” to
the hadoop directory you created in HDFS
hadoop fs -copyFromLocal /home/training/purchases.txt hadoop/
-----------------------------------------------------------------
# 19. To view the contents of your text file purchases.txt which is present in your
hadoop directory.
hadoop fs -cat hadoop/purchases.txt
-----------------------------------------------------------------
# 20. Move a directory from one location to other
hadoop fs -mv hadoop apache_hadoop
-----------------------------------------------------------------
# 21. Add the purchases.txt file from “hadoop” directory which is present in HDFS
directory to the directory “data” which is present in your local directory
hadoop fs -copyToLocal hadoop/purchases.txt /home/training/data
# 26. Default names of owner and group are training,training
# Use ‘-chown’ to change owner name and group name simultaneously
hadoop fs -ls hadoop/purchases.txt
sudo -u hdfs hadoop fs -chown root:root hadoop/purchases.txt
-----------------------------------------------------------------
# 27. Default name of group is training
# Use ‘-chgrp’ command to change group name
hadoop fs -ls hadoop/purchases.txt
sudo -u hdfs hadoop fs -chgrp training hadoop/purchases.txt
-----------------------------------------------------------------
# 28. Default replication factor to a file is 3.
# Use ‘-setrep’ command to change replication factor of a file
hadoop fs -setrep -w 2 apache_hadoop/sample.txt
29. Copy a directory from one node in the cluster to another
# Use ‘-distcp’ command to copy,
# Use '-overwrite' option to overwrite in an existing files
# Use '-update' command to synchronize both directories
hadoop fs -distcp hdfs://namenodeA/apache_hadoop hdfs://namenodeB/hadoop
-----------------------------------------------------------------
# 30. Command to make the name node leave safe mode
hadoop fs -expunge
sudo -u hdfs hdfs dfsadmin -safemode leave
-----------------------------------------------------------------
# 31. List all the hadoop file system shell commands
hadoop fs
-----------------------------------------------------------------
# 33. Last but not least, always ask for help!
hadoop fs -help
Thank You

More Related Content

What's hot

Hadoop Distributed File System
Hadoop Distributed File SystemHadoop Distributed File System
Hadoop Distributed File System
Rutvik Bapat
 
Hadoop Training | Hadoop Training For Beginners | Hadoop Architecture | Hadoo...
Hadoop Training | Hadoop Training For Beginners | Hadoop Architecture | Hadoo...Hadoop Training | Hadoop Training For Beginners | Hadoop Architecture | Hadoo...
Hadoop Training | Hadoop Training For Beginners | Hadoop Architecture | Hadoo...
Simplilearn
 
Hadoop Overview & Architecture
Hadoop Overview & Architecture  Hadoop Overview & Architecture
Hadoop Overview & Architecture
EMC
 
Schemaless Databases
Schemaless DatabasesSchemaless Databases
Schemaless Databases
Dan Gunter
 
PPT on Hadoop
PPT on HadoopPPT on Hadoop
PPT on Hadoop
Shubham Parmar
 
Hadoop Tutorial For Beginners | Apache Hadoop Tutorial For Beginners | Hadoop...
Hadoop Tutorial For Beginners | Apache Hadoop Tutorial For Beginners | Hadoop...Hadoop Tutorial For Beginners | Apache Hadoop Tutorial For Beginners | Hadoop...
Hadoop Tutorial For Beginners | Apache Hadoop Tutorial For Beginners | Hadoop...
Simplilearn
 
Hive Tutorial | Hive Architecture | Hive Tutorial For Beginners | Hive In Had...
Hive Tutorial | Hive Architecture | Hive Tutorial For Beginners | Hive In Had...Hive Tutorial | Hive Architecture | Hive Tutorial For Beginners | Hive In Had...
Hive Tutorial | Hive Architecture | Hive Tutorial For Beginners | Hive In Had...
Simplilearn
 
Apache HBase™
Apache HBase™Apache HBase™
Apache HBase™
Prashant Gupta
 
Mining Data Streams
Mining Data StreamsMining Data Streams
Mining Data Streams
SujaAldrin
 
Introduction to Hadoop
Introduction to HadoopIntroduction to Hadoop
Introduction to Hadoop
Apache Apex
 
Hadoop Architecture
Hadoop ArchitectureHadoop Architecture
Hadoop Architecture
Dr. C.V. Suresh Babu
 
HADOOP TECHNOLOGY ppt
HADOOP  TECHNOLOGY pptHADOOP  TECHNOLOGY ppt
HADOOP TECHNOLOGY ppt
sravya raju
 
Hadoop Ecosystem | Hadoop Ecosystem Tutorial | Hadoop Tutorial For Beginners ...
Hadoop Ecosystem | Hadoop Ecosystem Tutorial | Hadoop Tutorial For Beginners ...Hadoop Ecosystem | Hadoop Ecosystem Tutorial | Hadoop Tutorial For Beginners ...
Hadoop Ecosystem | Hadoop Ecosystem Tutorial | Hadoop Tutorial For Beginners ...
Simplilearn
 
Hadoop And Their Ecosystem ppt
 Hadoop And Their Ecosystem ppt Hadoop And Their Ecosystem ppt
Hadoop And Their Ecosystem ppt
sunera pathan
 
Introduction to Map Reduce
Introduction to Map ReduceIntroduction to Map Reduce
Introduction to Map Reduce
Apache Apex
 
Big Data Ecosystem
Big Data EcosystemBig Data Ecosystem
Big Data Ecosystem
Lucian Neghina
 
Introduction to hadoop
Introduction to hadoopIntroduction to hadoop
Introduction to hadoop
karthika karthi
 
Hadoop Tutorial For Beginners
Hadoop Tutorial For BeginnersHadoop Tutorial For Beginners
Hadoop Tutorial For Beginners
Dataflair Web Services Pvt Ltd
 
Hadoop YARN
Hadoop YARNHadoop YARN
Hadoop YARN
Vigen Sahakyan
 

What's hot (20)

Hadoop Distributed File System
Hadoop Distributed File SystemHadoop Distributed File System
Hadoop Distributed File System
 
Hadoop Training | Hadoop Training For Beginners | Hadoop Architecture | Hadoo...
Hadoop Training | Hadoop Training For Beginners | Hadoop Architecture | Hadoo...Hadoop Training | Hadoop Training For Beginners | Hadoop Architecture | Hadoo...
Hadoop Training | Hadoop Training For Beginners | Hadoop Architecture | Hadoo...
 
HDFS Architecture
HDFS ArchitectureHDFS Architecture
HDFS Architecture
 
Hadoop Overview & Architecture
Hadoop Overview & Architecture  Hadoop Overview & Architecture
Hadoop Overview & Architecture
 
Schemaless Databases
Schemaless DatabasesSchemaless Databases
Schemaless Databases
 
PPT on Hadoop
PPT on HadoopPPT on Hadoop
PPT on Hadoop
 
Hadoop Tutorial For Beginners | Apache Hadoop Tutorial For Beginners | Hadoop...
Hadoop Tutorial For Beginners | Apache Hadoop Tutorial For Beginners | Hadoop...Hadoop Tutorial For Beginners | Apache Hadoop Tutorial For Beginners | Hadoop...
Hadoop Tutorial For Beginners | Apache Hadoop Tutorial For Beginners | Hadoop...
 
Hive Tutorial | Hive Architecture | Hive Tutorial For Beginners | Hive In Had...
Hive Tutorial | Hive Architecture | Hive Tutorial For Beginners | Hive In Had...Hive Tutorial | Hive Architecture | Hive Tutorial For Beginners | Hive In Had...
Hive Tutorial | Hive Architecture | Hive Tutorial For Beginners | Hive In Had...
 
Apache HBase™
Apache HBase™Apache HBase™
Apache HBase™
 
Mining Data Streams
Mining Data StreamsMining Data Streams
Mining Data Streams
 
Introduction to Hadoop
Introduction to HadoopIntroduction to Hadoop
Introduction to Hadoop
 
Hadoop Architecture
Hadoop ArchitectureHadoop Architecture
Hadoop Architecture
 
HADOOP TECHNOLOGY ppt
HADOOP  TECHNOLOGY pptHADOOP  TECHNOLOGY ppt
HADOOP TECHNOLOGY ppt
 
Hadoop Ecosystem | Hadoop Ecosystem Tutorial | Hadoop Tutorial For Beginners ...
Hadoop Ecosystem | Hadoop Ecosystem Tutorial | Hadoop Tutorial For Beginners ...Hadoop Ecosystem | Hadoop Ecosystem Tutorial | Hadoop Tutorial For Beginners ...
Hadoop Ecosystem | Hadoop Ecosystem Tutorial | Hadoop Tutorial For Beginners ...
 
Hadoop And Their Ecosystem ppt
 Hadoop And Their Ecosystem ppt Hadoop And Their Ecosystem ppt
Hadoop And Their Ecosystem ppt
 
Introduction to Map Reduce
Introduction to Map ReduceIntroduction to Map Reduce
Introduction to Map Reduce
 
Big Data Ecosystem
Big Data EcosystemBig Data Ecosystem
Big Data Ecosystem
 
Introduction to hadoop
Introduction to hadoopIntroduction to hadoop
Introduction to hadoop
 
Hadoop Tutorial For Beginners
Hadoop Tutorial For BeginnersHadoop Tutorial For Beginners
Hadoop Tutorial For Beginners
 
Hadoop YARN
Hadoop YARNHadoop YARN
Hadoop YARN
 

Similar to Hadoop File system (HDFS)

Module 01 - Understanding Big Data and Hadoop 1.x,2.x
Module 01 - Understanding Big Data and Hadoop 1.x,2.xModule 01 - Understanding Big Data and Hadoop 1.x,2.x
Module 01 - Understanding Big Data and Hadoop 1.x,2.x
NPN Training
 
Introduction to Big Data Hadoop Training Online by www.itjobzone.biz
Introduction to Big Data Hadoop Training Online by www.itjobzone.bizIntroduction to Big Data Hadoop Training Online by www.itjobzone.biz
Introduction to Big Data Hadoop Training Online by www.itjobzone.biz
ITJobZone.biz
 
Lecture 2 part 1
Lecture 2 part 1Lecture 2 part 1
Lecture 2 part 1
Jazan University
 
Hadoop file system
Hadoop file systemHadoop file system
Hadoop file system
John Veigas
 
Hadoop introduction
Hadoop introductionHadoop introduction
Hadoop introduction
Subhas Kumar Ghosh
 
Inroduction to Big Data
Inroduction to Big DataInroduction to Big Data
Inroduction to Big Data
Omnia Safaan
 
Hadoop Developer
Hadoop DeveloperHadoop Developer
Hadoop Developer
Edureka!
 
Vikram Andem Big Data Strategy @ IATA Technology Roadmap
Vikram Andem Big Data Strategy @ IATA Technology Roadmap Vikram Andem Big Data Strategy @ IATA Technology Roadmap
Vikram Andem Big Data Strategy @ IATA Technology Roadmap
IT Strategy Group
 
Voldemort & Hadoop @ Linkedin, Hadoop User Group Jan 2010
Voldemort & Hadoop @ Linkedin, Hadoop User Group Jan 2010Voldemort & Hadoop @ Linkedin, Hadoop User Group Jan 2010
Voldemort & Hadoop @ Linkedin, Hadoop User Group Jan 2010
Bhupesh Bansal
 
Hadoop and Voldemort @ LinkedIn
Hadoop and Voldemort @ LinkedInHadoop and Voldemort @ LinkedIn
Hadoop and Voldemort @ LinkedIn
Hadoop User Group
 
Big data Hadoop presentation
Big data  Hadoop  presentation Big data  Hadoop  presentation
Big data Hadoop presentation
Shivanee garg
 
Apache hadoop and hive
Apache hadoop and hiveApache hadoop and hive
Apache hadoop and hive
srikanthhadoop
 
Hadoop project design and a usecase
Hadoop project design and  a usecaseHadoop project design and  a usecase
Hadoop project design and a usecase
sudhakara st
 
Construindo Data Lakes - Visão Prática com Hadoop e BigData
Construindo Data Lakes - Visão Prática com Hadoop e BigDataConstruindo Data Lakes - Visão Prática com Hadoop e BigData
Construindo Data Lakes - Visão Prática com Hadoop e BigData
Marco Garcia
 
Big Data and Hadoop
Big Data and HadoopBig Data and Hadoop
Big Data and Hadoop
Mr. Ankit
 
Introduction to hadoop and hdfs
Introduction to hadoop and hdfsIntroduction to hadoop and hdfs
Introduction to hadoop and hdfs
shrey mehrotra
 
OPERATING SYSTEM .pptx
OPERATING SYSTEM .pptxOPERATING SYSTEM .pptx
OPERATING SYSTEM .pptx
AltafKhadim
 
Unit-1 Introduction to Big Data.pptx
Unit-1 Introduction to Big Data.pptxUnit-1 Introduction to Big Data.pptx
Unit-1 Introduction to Big Data.pptx
AnkitChauhan817826
 

Similar to Hadoop File system (HDFS) (20)

Module 01 - Understanding Big Data and Hadoop 1.x,2.x
Module 01 - Understanding Big Data and Hadoop 1.x,2.xModule 01 - Understanding Big Data and Hadoop 1.x,2.x
Module 01 - Understanding Big Data and Hadoop 1.x,2.x
 
Introduction to Big Data Hadoop Training Online by www.itjobzone.biz
Introduction to Big Data Hadoop Training Online by www.itjobzone.bizIntroduction to Big Data Hadoop Training Online by www.itjobzone.biz
Introduction to Big Data Hadoop Training Online by www.itjobzone.biz
 
Lecture 2 part 1
Lecture 2 part 1Lecture 2 part 1
Lecture 2 part 1
 
Hadoop file system
Hadoop file systemHadoop file system
Hadoop file system
 
hadoop
hadoophadoop
hadoop
 
hadoop
hadoophadoop
hadoop
 
Hadoop introduction
Hadoop introductionHadoop introduction
Hadoop introduction
 
Inroduction to Big Data
Inroduction to Big DataInroduction to Big Data
Inroduction to Big Data
 
Hadoop Developer
Hadoop DeveloperHadoop Developer
Hadoop Developer
 
Vikram Andem Big Data Strategy @ IATA Technology Roadmap
Vikram Andem Big Data Strategy @ IATA Technology Roadmap Vikram Andem Big Data Strategy @ IATA Technology Roadmap
Vikram Andem Big Data Strategy @ IATA Technology Roadmap
 
Voldemort & Hadoop @ Linkedin, Hadoop User Group Jan 2010
Voldemort & Hadoop @ Linkedin, Hadoop User Group Jan 2010Voldemort & Hadoop @ Linkedin, Hadoop User Group Jan 2010
Voldemort & Hadoop @ Linkedin, Hadoop User Group Jan 2010
 
Hadoop and Voldemort @ LinkedIn
Hadoop and Voldemort @ LinkedInHadoop and Voldemort @ LinkedIn
Hadoop and Voldemort @ LinkedIn
 
Big data Hadoop presentation
Big data  Hadoop  presentation Big data  Hadoop  presentation
Big data Hadoop presentation
 
Apache hadoop and hive
Apache hadoop and hiveApache hadoop and hive
Apache hadoop and hive
 
Hadoop project design and a usecase
Hadoop project design and  a usecaseHadoop project design and  a usecase
Hadoop project design and a usecase
 
Construindo Data Lakes - Visão Prática com Hadoop e BigData
Construindo Data Lakes - Visão Prática com Hadoop e BigDataConstruindo Data Lakes - Visão Prática com Hadoop e BigData
Construindo Data Lakes - Visão Prática com Hadoop e BigData
 
Big Data and Hadoop
Big Data and HadoopBig Data and Hadoop
Big Data and Hadoop
 
Introduction to hadoop and hdfs
Introduction to hadoop and hdfsIntroduction to hadoop and hdfs
Introduction to hadoop and hdfs
 
OPERATING SYSTEM .pptx
OPERATING SYSTEM .pptxOPERATING SYSTEM .pptx
OPERATING SYSTEM .pptx
 
Unit-1 Introduction to Big Data.pptx
Unit-1 Introduction to Big Data.pptxUnit-1 Introduction to Big Data.pptx
Unit-1 Introduction to Big Data.pptx
 

More from Prashant Gupta

Spark core
Spark coreSpark core
Spark core
Prashant Gupta
 
Spark Sql and DataFrame
Spark Sql and DataFrameSpark Sql and DataFrame
Spark Sql and DataFrame
Prashant Gupta
 
Apache PIG
Apache PIGApache PIG
Apache PIG
Prashant Gupta
 
Map reduce prashant
Map reduce prashantMap reduce prashant
Map reduce prashant
Prashant Gupta
 
Sqoop
SqoopSqoop
6.hive
6.hive6.hive
Mongodb - NoSql Database
Mongodb - NoSql DatabaseMongodb - NoSql Database
Mongodb - NoSql Database
Prashant Gupta
 
Sonar Tool - JAVA code analysis
Sonar Tool - JAVA code analysisSonar Tool - JAVA code analysis
Sonar Tool - JAVA code analysis
Prashant Gupta
 

More from Prashant Gupta (8)

Spark core
Spark coreSpark core
Spark core
 
Spark Sql and DataFrame
Spark Sql and DataFrameSpark Sql and DataFrame
Spark Sql and DataFrame
 
Apache PIG
Apache PIGApache PIG
Apache PIG
 
Map reduce prashant
Map reduce prashantMap reduce prashant
Map reduce prashant
 
Sqoop
SqoopSqoop
Sqoop
 
6.hive
6.hive6.hive
6.hive
 
Mongodb - NoSql Database
Mongodb - NoSql DatabaseMongodb - NoSql Database
Mongodb - NoSql Database
 
Sonar Tool - JAVA code analysis
Sonar Tool - JAVA code analysisSonar Tool - JAVA code analysis
Sonar Tool - JAVA code analysis
 

Recently uploaded

FP Growth Algorithm and its Applications
FP Growth Algorithm and its ApplicationsFP Growth Algorithm and its Applications
FP Growth Algorithm and its Applications
MaleehaSheikh2
 
一比一原版(BU毕业证)波士顿大学毕业证成绩单
一比一原版(BU毕业证)波士顿大学毕业证成绩单一比一原版(BU毕业证)波士顿大学毕业证成绩单
一比一原版(BU毕业证)波士顿大学毕业证成绩单
ewymefz
 
一比一原版(UMich毕业证)密歇根大学|安娜堡分校毕业证成绩单
一比一原版(UMich毕业证)密歇根大学|安娜堡分校毕业证成绩单一比一原版(UMich毕业证)密歇根大学|安娜堡分校毕业证成绩单
一比一原版(UMich毕业证)密歇根大学|安娜堡分校毕业证成绩单
ewymefz
 
1.Seydhcuxhxyxhccuuxuxyxyxmisolids 2019.pptx
1.Seydhcuxhxyxhccuuxuxyxyxmisolids 2019.pptx1.Seydhcuxhxyxhccuuxuxyxyxmisolids 2019.pptx
1.Seydhcuxhxyxhccuuxuxyxyxmisolids 2019.pptx
Tiktokethiodaily
 
The affect of service quality and online reviews on customer loyalty in the E...
The affect of service quality and online reviews on customer loyalty in the E...The affect of service quality and online reviews on customer loyalty in the E...
The affect of service quality and online reviews on customer loyalty in the E...
jerlynmaetalle
 
Tabula.io Cheatsheet: automate your data workflows
Tabula.io Cheatsheet: automate your data workflowsTabula.io Cheatsheet: automate your data workflows
Tabula.io Cheatsheet: automate your data workflows
alex933524
 
Algorithmic optimizations for Dynamic Levelwise PageRank (from STICD) : SHORT...
Algorithmic optimizations for Dynamic Levelwise PageRank (from STICD) : SHORT...Algorithmic optimizations for Dynamic Levelwise PageRank (from STICD) : SHORT...
Algorithmic optimizations for Dynamic Levelwise PageRank (from STICD) : SHORT...
Subhajit Sahu
 
Empowering Data Analytics Ecosystem.pptx
Empowering Data Analytics Ecosystem.pptxEmpowering Data Analytics Ecosystem.pptx
Empowering Data Analytics Ecosystem.pptx
benishzehra469
 
Jpolillo Amazon PPC - Bid Optimization Sample
Jpolillo Amazon PPC - Bid Optimization SampleJpolillo Amazon PPC - Bid Optimization Sample
Jpolillo Amazon PPC - Bid Optimization Sample
James Polillo
 
一比一原版(QU毕业证)皇后大学毕业证成绩单
一比一原版(QU毕业证)皇后大学毕业证成绩单一比一原版(QU毕业证)皇后大学毕业证成绩单
一比一原版(QU毕业证)皇后大学毕业证成绩单
enxupq
 
一比一原版(NYU毕业证)纽约大学毕业证成绩单
一比一原版(NYU毕业证)纽约大学毕业证成绩单一比一原版(NYU毕业证)纽约大学毕业证成绩单
一比一原版(NYU毕业证)纽约大学毕业证成绩单
ewymefz
 
Chatty Kathy - UNC Bootcamp Final Project Presentation - Final Version - 5.23...
Chatty Kathy - UNC Bootcamp Final Project Presentation - Final Version - 5.23...Chatty Kathy - UNC Bootcamp Final Project Presentation - Final Version - 5.23...
Chatty Kathy - UNC Bootcamp Final Project Presentation - Final Version - 5.23...
John Andrews
 
【社内勉強会資料_Octo: An Open-Source Generalist Robot Policy】
【社内勉強会資料_Octo: An Open-Source Generalist Robot Policy】【社内勉強会資料_Octo: An Open-Source Generalist Robot Policy】
【社内勉強会資料_Octo: An Open-Source Generalist Robot Policy】
NABLAS株式会社
 
Criminal IP - Threat Hunting Webinar.pdf
Criminal IP - Threat Hunting Webinar.pdfCriminal IP - Threat Hunting Webinar.pdf
Criminal IP - Threat Hunting Webinar.pdf
Criminal IP
 
一比一原版(IIT毕业证)伊利诺伊理工大学毕业证成绩单
一比一原版(IIT毕业证)伊利诺伊理工大学毕业证成绩单一比一原版(IIT毕业证)伊利诺伊理工大学毕业证成绩单
一比一原版(IIT毕业证)伊利诺伊理工大学毕业证成绩单
ewymefz
 
哪里卖(usq毕业证书)南昆士兰大学毕业证研究生文凭证书托福证书原版一模一样
哪里卖(usq毕业证书)南昆士兰大学毕业证研究生文凭证书托福证书原版一模一样哪里卖(usq毕业证书)南昆士兰大学毕业证研究生文凭证书托福证书原版一模一样
哪里卖(usq毕业证书)南昆士兰大学毕业证研究生文凭证书托福证书原版一模一样
axoqas
 
standardisation of garbhpala offhgfffghh
standardisation of garbhpala offhgfffghhstandardisation of garbhpala offhgfffghh
standardisation of garbhpala offhgfffghh
ArpitMalhotra16
 
Sample_Global Non-invasive Prenatal Testing (NIPT) Market, 2019-2030.pdf
Sample_Global Non-invasive Prenatal Testing (NIPT) Market, 2019-2030.pdfSample_Global Non-invasive Prenatal Testing (NIPT) Market, 2019-2030.pdf
Sample_Global Non-invasive Prenatal Testing (NIPT) Market, 2019-2030.pdf
Linda486226
 
Business update Q1 2024 Lar España Real Estate SOCIMI
Business update Q1 2024 Lar España Real Estate SOCIMIBusiness update Q1 2024 Lar España Real Estate SOCIMI
Business update Q1 2024 Lar España Real Estate SOCIMI
AlejandraGmez176757
 
一比一原版(UPenn毕业证)宾夕法尼亚大学毕业证成绩单
一比一原版(UPenn毕业证)宾夕法尼亚大学毕业证成绩单一比一原版(UPenn毕业证)宾夕法尼亚大学毕业证成绩单
一比一原版(UPenn毕业证)宾夕法尼亚大学毕业证成绩单
ewymefz
 

Recently uploaded (20)

FP Growth Algorithm and its Applications
FP Growth Algorithm and its ApplicationsFP Growth Algorithm and its Applications
FP Growth Algorithm and its Applications
 
一比一原版(BU毕业证)波士顿大学毕业证成绩单
一比一原版(BU毕业证)波士顿大学毕业证成绩单一比一原版(BU毕业证)波士顿大学毕业证成绩单
一比一原版(BU毕业证)波士顿大学毕业证成绩单
 
一比一原版(UMich毕业证)密歇根大学|安娜堡分校毕业证成绩单
一比一原版(UMich毕业证)密歇根大学|安娜堡分校毕业证成绩单一比一原版(UMich毕业证)密歇根大学|安娜堡分校毕业证成绩单
一比一原版(UMich毕业证)密歇根大学|安娜堡分校毕业证成绩单
 
1.Seydhcuxhxyxhccuuxuxyxyxmisolids 2019.pptx
1.Seydhcuxhxyxhccuuxuxyxyxmisolids 2019.pptx1.Seydhcuxhxyxhccuuxuxyxyxmisolids 2019.pptx
1.Seydhcuxhxyxhccuuxuxyxyxmisolids 2019.pptx
 
The affect of service quality and online reviews on customer loyalty in the E...
The affect of service quality and online reviews on customer loyalty in the E...The affect of service quality and online reviews on customer loyalty in the E...
The affect of service quality and online reviews on customer loyalty in the E...
 
Tabula.io Cheatsheet: automate your data workflows
Tabula.io Cheatsheet: automate your data workflowsTabula.io Cheatsheet: automate your data workflows
Tabula.io Cheatsheet: automate your data workflows
 
Algorithmic optimizations for Dynamic Levelwise PageRank (from STICD) : SHORT...
Algorithmic optimizations for Dynamic Levelwise PageRank (from STICD) : SHORT...Algorithmic optimizations for Dynamic Levelwise PageRank (from STICD) : SHORT...
Algorithmic optimizations for Dynamic Levelwise PageRank (from STICD) : SHORT...
 
Empowering Data Analytics Ecosystem.pptx
Empowering Data Analytics Ecosystem.pptxEmpowering Data Analytics Ecosystem.pptx
Empowering Data Analytics Ecosystem.pptx
 
Jpolillo Amazon PPC - Bid Optimization Sample
Jpolillo Amazon PPC - Bid Optimization SampleJpolillo Amazon PPC - Bid Optimization Sample
Jpolillo Amazon PPC - Bid Optimization Sample
 
一比一原版(QU毕业证)皇后大学毕业证成绩单
一比一原版(QU毕业证)皇后大学毕业证成绩单一比一原版(QU毕业证)皇后大学毕业证成绩单
一比一原版(QU毕业证)皇后大学毕业证成绩单
 
一比一原版(NYU毕业证)纽约大学毕业证成绩单
一比一原版(NYU毕业证)纽约大学毕业证成绩单一比一原版(NYU毕业证)纽约大学毕业证成绩单
一比一原版(NYU毕业证)纽约大学毕业证成绩单
 
Chatty Kathy - UNC Bootcamp Final Project Presentation - Final Version - 5.23...
Chatty Kathy - UNC Bootcamp Final Project Presentation - Final Version - 5.23...Chatty Kathy - UNC Bootcamp Final Project Presentation - Final Version - 5.23...
Chatty Kathy - UNC Bootcamp Final Project Presentation - Final Version - 5.23...
 
【社内勉強会資料_Octo: An Open-Source Generalist Robot Policy】
【社内勉強会資料_Octo: An Open-Source Generalist Robot Policy】【社内勉強会資料_Octo: An Open-Source Generalist Robot Policy】
【社内勉強会資料_Octo: An Open-Source Generalist Robot Policy】
 
Criminal IP - Threat Hunting Webinar.pdf
Criminal IP - Threat Hunting Webinar.pdfCriminal IP - Threat Hunting Webinar.pdf
Criminal IP - Threat Hunting Webinar.pdf
 
一比一原版(IIT毕业证)伊利诺伊理工大学毕业证成绩单
一比一原版(IIT毕业证)伊利诺伊理工大学毕业证成绩单一比一原版(IIT毕业证)伊利诺伊理工大学毕业证成绩单
一比一原版(IIT毕业证)伊利诺伊理工大学毕业证成绩单
 
哪里卖(usq毕业证书)南昆士兰大学毕业证研究生文凭证书托福证书原版一模一样
哪里卖(usq毕业证书)南昆士兰大学毕业证研究生文凭证书托福证书原版一模一样哪里卖(usq毕业证书)南昆士兰大学毕业证研究生文凭证书托福证书原版一模一样
哪里卖(usq毕业证书)南昆士兰大学毕业证研究生文凭证书托福证书原版一模一样
 
standardisation of garbhpala offhgfffghh
standardisation of garbhpala offhgfffghhstandardisation of garbhpala offhgfffghh
standardisation of garbhpala offhgfffghh
 
Sample_Global Non-invasive Prenatal Testing (NIPT) Market, 2019-2030.pdf
Sample_Global Non-invasive Prenatal Testing (NIPT) Market, 2019-2030.pdfSample_Global Non-invasive Prenatal Testing (NIPT) Market, 2019-2030.pdf
Sample_Global Non-invasive Prenatal Testing (NIPT) Market, 2019-2030.pdf
 
Business update Q1 2024 Lar España Real Estate SOCIMI
Business update Q1 2024 Lar España Real Estate SOCIMIBusiness update Q1 2024 Lar España Real Estate SOCIMI
Business update Q1 2024 Lar España Real Estate SOCIMI
 
一比一原版(UPenn毕业证)宾夕法尼亚大学毕业证成绩单
一比一原版(UPenn毕业证)宾夕法尼亚大学毕业证成绩单一比一原版(UPenn毕业证)宾夕法尼亚大学毕业证成绩单
一比一原版(UPenn毕业证)宾夕法尼亚大学毕业证成绩单
 

Hadoop File system (HDFS)

  • 2.  What is Big Data?  Limitations of the existing solutions  Solving the problem with Hadoop  Introduction to Hadoop  Hadoop Eco-System  Hadoop Core Components  HDFS Architecture  Anatomy of a File Write and Read Topics of the Day Slide 2
  • 3. What is Big Data? Slide 3  Lots of Data (Terabytes or Petabytes)  Big data is the term for a collection of data sets so large and complex that it becomes difficult to process using on-hand database management tools or traditional data processing applications.  The challenges include capture, curation, storage, search, sharing, transfer, analysis, and visualization. cloud tools statistics No SQL compression storage support database analyze information terabytes mobile processing Big Data
  • 4. What is Big Data? Slide 4  Systems / Enterprises generate huge amount of data from Terabytes to and even Petabytes of information. NYSE generates about one terabyte of new trade data per day to perform stock trading analytics to determine trends for optimal trades.
  • 5. Un – Structured Data is Exploding Slide 5
  • 6.  IBM’s Definition – Big Data Characteristics http://www-01.ibm.com/software/data/bigdata/ Web logs Videos Images Audios Sensor Data Slide 6 Volume Velocity Variety IBM’s Definition
  • 7.  Web and e-tailing  Recommendation Engines  Ad Targeting  Search Quality  Abuse and Click Fraud Detection  Telecommunications  Customer Churn Prevention  Network Performance Optimization  Calling Data Record (CDR) Analysis  Analyzing Network to Predict Failure Common Big Data Customer Scenarios Slide 7
  • 8.  Government  Fraud Detection and Cyber Security  Welfare Schemes  Justice  Healthcare & Life Sciences  Health Information Exchange  Gene Sequencing  Serialization  Healthcare Service Quality Improvements  Drug Safety Common Big Data Customer Scenarios(Contd.) Slide 8
  • 9. Common Big Data Customer Scenarios Contd.) Slide 9  Banks and Financial services  Modeling True Risk  Threat Analysis  Fraud Detection  Trade Surveillance  Credit Scoring and Analysis  Retail  Point of Sales Transaction Analysis  Customer Churn Analysis  Sentiment Analysis
  • 10. Hidden Treasure Case Study: Sears Holding Corporation X *Sears was using traditional systems such as Oracle Exadata, Teradata and SAS etc. to store and process the customer activity and sales data. Slide 17  Insight into data can provide Business Advantage.  Some key early indicators can mean Fortunes to Business.  More Precise Analysis with more data.
  • 11. Mostly Append Collection Instrumentation http://www.informationweek.com/it-leadership/why-sears-is-going-all-in-on-hadoop/d/d-id/1107038? 90% of the ~2PB Archived Storage Processing BI Reports + Interactive Apps RDBMS (Aggregated Data) ETL Compute Grid 3. Premature data death 1. Can’t explore original high fidelity raw data 2. Moving data to compute doesn’t scale A meagre 10% of the ~2PB Data is available for BI Storage only Grid (original Raw Data) Limitations of Existing Data Analytics Architecture Slide 11
  • 12. *Sears moved to a 300-Node Hadoop cluster to keep 100% of its data available for processing rather than a meagre 10% as was the case with existing Non-Hadoop solutions. RDBMS (Aggregated Data) No Data Archiving 1. Data Exploration & Advanced analytics 2. Scalable throughput for ETL & aggregation 3. Keep data alive forever Mostly Append Collection Instrumentation BI Reports + Interactive Apps Hadoop : Storage + Compute Grid Entire ~2PB Data is available for processing Both Storage And Processing Solution: A Combined Storage Computer Layer Slide 12
  • 14. RDBMS EDW Hadoop – It’s about Scale and Structure Slide 14 MPP NoSQL HADOOP Structured Data Types Multi and Unstructured Limited, No Data Processing Processing Processing coupled with Data Standards & Structured Governance Loosely Structured Required On write Schema Required On Read Software License Cost Support Only Known Entity Resources Growing, Complexities, Wide Interactive OLAP Analytics Operational Data Store Best Fit Use Massive Storage/Processing
  • 15. Read 1 TB Data 1 Machine 4 I/O Channels Each Channel – 100 MB/s 10 Machines 4 I/O Channels Each Channel – 100 MB/s Why DFS? Slide 15
  • 16. Read 1 TB Data 10 Machines 4 I/O Channels Each Channel – 100 MB/s Why DFS? Slide 16 45 Minutes 1 Machine 4 I/O Channels Each Channel – 100 MB/s
  • 17. Read 1 TB Data 10 Machines 4 I/O Channels Each Channel – 100 MB/s Why DFS? Slide 17 45 Minutes 4.5 Minutes 1 Machine 4 I/O Channels Each Channel – 100 MB/s
  • 18. What is Hadoop? Slide 18  Apache Hadoop is a framework that allows for the distributed processing of large data sets across clusters of commodity computers using a simple programming model.  It is an Open-source Data Management with scale-out storage & distributed processing.
  • 19. Apache Oozie (Workflow) MapReduce Framework HBase HDFS (Hadoop Distributed File System) Flume Sqoop Import Or Export Pig Latin Data Analysis Mahout Machine Learning Hive DW System Unstructured or Semi-Structured data Structured Data Hadoop Eco-System Slide 19
  • 20. Data Node Task Tracker Data Node Task Tracker Data Node Task Tracker Data Node Task Tracker MapReduce Engine HDFS Cluster Job Tracker Admin Node Name node Hadoop Core Components Slide 20
  • 21. Hadoop Core Components Slide 21 Hadoop is a system for large scale data processing. It has two main components:  HDFS – Hadoop Distributed File System (Storage)  Distributed across “nodes”  Natively redundant  NameNode tracks locations.  MapReduce (Processing)  Splits a task across processors  “near” the data & assembles results  Self-Healing, High Bandwidth  Clustered storage  JobTracker manages the TaskTrackers
  • 22. HDFS Definition Slide 22 The Hadoop Distributed File System (HDFS) is a distributed file system designed to run on commodity hardware. HDFS is a distributed, scalable, and portable filesystem written in Java for the Hadoop framework. It has many similarities with existing distributed file systems. Hadoop Distributed File System (HDFS™) is the primary storage system used by Hadoop applications. HDFS creates multiple replicas of data blocks and distributes them on compute nodes throughout a cluster to enable reliable, extremely rapid computations. HDFS is highly fault-tolerant and is designed to be deployed on low-cost hardware. HDFS provides high throughput access to application data and is suitable for applications that have large data sets • HDFS consists of following components (daemons) – Name Node – Data Node – Secondary Name Node
  • 23. HDFS Components Slide 23 Namenode: NameNode, a master server, manages the file system namespace and regulates access to files by clients. Maintains and Manages the blocks which are present on the datanode.  Meta-data in Memory – The entire metadata is in main memory  Types of Metadata – List of files – List of Blocks for each file – List of DataNodes for each block – File attributes, e.g creation time, replication factor  A Transaction Log – Records file creations, file deletions. Etc
  • 24. HDFS Components Slide 24 Data Node: Slaves which are deployed on each mechine and provide the actual storage. DataNodes, one per node in the cluster, manages storage attached to the nodes that they run on  A Block Server – Stores data in the local file system (e.g. ext3) – Stores meta-data of a block (e.g. CRC) – Serves data and meta-data to Clients – Block Report – Periodically sends a report of all existing blocks to the NameNode  Facilitates Pipelining of Data – Forwards data to other specified DataNodes
  • 25. Understanding the File system Block placement • Current Strategy  One replica on local node  Second replica on a remote rack  Third replica on same remote rack  Additional replicas are randomly placed • Clients read from nearest replica Data Correctness • Use Checksums to validate data  Use CRC32 • File Creation  Client computes checksum per 512 byte  DataNode stores the checksum • File access  Client retrieves the data and checksum from DataNode
  • 26. Understanding the File system Data pipelining  Client retrieves a list of DataNodes on which to place replicas of a block  Client writes block to the first DataNode  The first DataNode forwards the data to the next DataNode in the Pipeline  When all replicas are written, the Client moves on to write the next block in file Rebalancer • Goal: % of disk occupied on Datanodes should be similar  Usually run when new Datanodes are added  Cluster is online when Rebalancer is active  Rebalancer is throttled to avoid network congestion  Command line tool
  • 27. Secondary NameNode Slide 27  Secondary NameNode: Not a hot standby for the NameNode Connects to NameNode every hour* Housekeeping, backup of NemeNode metadata Saved metadata can build a failed NameNode
  • 28. Metadata (Name, replicas,…) Rack 1 Blocks DataNodes Block ops Replication Write DataNodes Metadata ops Client Read NameNode Rack 2 Client HDFS Architecture Slide 28
  • 29. NameNode DataNode DataNode DataNode DataNode 2. Create 7. Complete 5. ack Packet4. Write Packet Pipeline of Data nodes DataNode DataNode Distributed File System HDFS Client 1. Create 3. Write 6. Close Anatomy of A File Write Slide 29
  • 30. NameNode NameNode DataNode DataNode DataNode DataNode 2. Get Block locations 4. Read DataNode DataNode Distributed File System HDFS Client 1. Open 3. Read 5. Read FS Data Input Stream 6. Close Client JVM Client Node Anatomy of A File Read Slide 30
  • 31. Replication and Rack Awareness Slide 31
  • 32. Task TrackerTask Tracker Task TrackerTask TrackerData Node Task Tracker Client HDFS Name Node Task Tracker Task Tracker Data Node Data Node Map Reduce Job Tracker Hadoop Cluster Architecture (Contd.) Data Node Data Node Task Tracker Data Node Data Node Data Node
  • 33. Data Node RAM: 16GB Hard disk: 6 X 2TB Processor: Xenon with 2 cores. Ethernet: 3 X 10 GB/s OS: 64-bit CentOS Name Node RAM: 64 GB, Hard disk: 1 TB Processor: Xenon with 8 Cores Ethernet: 3 X 10 GB/s OS: 64-bit CentOS Power: Redundant Power Supply Secondary Name Node RAM: 32 GB, Hard disk: 1 TB Processor: Xenon with 4 Cores Ethernet: 3 X 10 GB/s OS: 64-bit CentOS Power: Redundant Power Supply Data Node RAM: 16GB Hard disk: 6 X 2TB Processor: Xenon with 2 cores. Ethernet: 3 X 10 GB/s OS: 64-bit CentOS Hadoop Cluster: A Typical Use Case
  • 35. Hadoop can run in any of the following three modes: Standalone (or Local) Mode  No daemons, everything runs in a single JVM.  Suitable for running MapReduce programs during development.  Has no DFS. Pseudo-Distributed Mode  Hadoop daemons run on the local machine. Fully-Distributed Mode  Hadoop daemons run on a cluster of machines. Hadoop Cluster Modes
  • 37. Hadoop Configuration Files Configuration Filenames Description of Log Files hadoop-env.sh Environment variables that are used in the scripts to run Hadoop. core-site.xml Configuration settings for Hadoop Core such as I/O settings that are common to HDFS and MapReduce. hdfs-site.xml Configuration settings for HDFS daemons, the namenode, the secondary namenode and the data nodes. mapred-site.xml Configuration settings for MapReduce daemons : the job-tracker and the task-trackers. masters A list of machines (one per line) that each run a secondary namenode. slaves A list of machines (one per line) that each run a datanode and a task-tracker.
  • 38. core-site.xml and hdfs-site.xml hdfs-site.xml core-site.xml <?xml version - "1.0"?> <?xml version ="1.0"?> <!--hdfs-site.xml--> <!--core-site.xml--> <configuration> <configuration> <property> <property> <name>dfs.replication</name> <name>fs.default.name</name> <value>1</value> <value>hdfs://localhost:8020/</value> </property> </property> </configuration> </configuration>
  • 39. Defining HDFS Details In hdfs-site.xml Property Value Description dfs.data.dir <value> /disk1/hdfs/data, /disk2/hdfs/data </value> A list of directories where the datanode stores blocks. Each block is stored in only one of these directories. ${hadoop.tmp.dir}/dfs/data fs.checkpoint.dir <value> /disk1/hdfs/namesecondary, /disk2/hdfs/namesecondary </value> A list of directories where the secondary namenode stores checkpoints. It stores a copy of the checkpoint in each directory in the list ${hadoop.tmp.dir}/dfs/namesecondary
  • 41. Defining mapred-site.xml Property Value Description mapred.job.tracker <value> localhost:8021 </value> The hostname and the port that the jobtracker RPC server runs on. If set to the default value of local, then the jobtracker runs in-process on demand when you run a MapReduce job. mapred.local.dir ${hadoop.tmp.dir}/mapred/local A list of directories where MapReduce stores intermediate data for jobs. The data is cleared out when the job ends. mapred.system.dir ${hadoop.tmp.dir}/mapred/syste m The directory relative to fs.default.name where shared files are stored, during a job run. mapred.tasktracker.map.tasks. maximum 2 The number of map tasks that may be run on a tasktracker at any one time mapred.tasktracker.reduce.tasks .maximum 2 The number of reduce tasks tat may be run on a tasktracker at any one time.
  • 43. Masters  Contains a list of hosts, one per line, that are to host Secondary NameNode servers. Slaves and Masters Two files are used by the startup and shutdown commands: Slaves  Contains a list of hosts, one per line, that are to host DataNode and TaskTracker servers.
  • 44.  This file also offers a way to provide custom parameters for each of the servers.  Hadoop-env.sh is sourced by all of the Hadoop Core scripts provided in the conf/ directory of the installation.  Examples of environment variables that you can specify: export HADOOP_DATANODE_HEAPSIZE="128" export HADOOP_TASKTRACKER_HEAPSIZE="512" Set parameter JAVA_HOME JVMhadoop-env.sh Per-Process Runtime Environment
  • 45.  NameNode status: http://localhost:50070/dfshealth.jsp  JobTracker status: http://localhost:50030/jobtracker.jsp  TaskTracker status: http://localhost:50060/tasktracker.jsp  DataBlock Scanner Report: http://localhost:50075/blockScannerReport Web UI URLs
  • 47. Open a terminal window to the current working directory. # /home/notroot ----------------------------------------------------------------- # 1. Print the Hadoop version hadoop version ----------------------------------------------------------------- # 2. List the contents of the root directory in HDFS hadoop fs -ls / ----------------------------------------------------------------- # 3. Report the amount of space used and available on currently mounted filesystem hadoop fs -df hdfs:/ ----------------------------------------------------------------- # 4. Count the number of directories,files and bytes under the paths that match the specified file pattern hadoop fs -count hdfs:/
  • 48. 5. Run a DFS filesystem checking utility hadoop fsck – / ----------------------------------------------------------------- # 6. Run a cluster balancing utility hadoop balancer ----------------------------------------------------------------- # 7. Create a new directory named “hadoop” below the /user/training directory in HDFS. hadoop fs -mkdir /user/training/hadoop ----------------------------------------------------------------- # 8. Add a sample text file from the local directory named “data” to the new directory you created in HDFS hadoop fs -put data/sample.txt /user/training/hadoop
  • 49. 9. List the contents of this new directory in HDFS. hadoop fs -ls /user/training/hadoop ----------------------------------------------------------------- # 10. Add the entire local directory called “retail” to the /user/training directory in HDFS. hadoop fs -put data/retail /user/training/hadoop ----------------------------------------------------------------- # 11. Since /user/training is your home directory in HDFS, any command that does not have an absolute path is interpreted as relative to that directory. # The next command will therefore list your home directory, and should show the items you’ve just added there. hadoop fs -ls ----------------------------------------------------------------- # 12. See how much space this directory occupies in HDFS. hadoop fs -du -s -h hadoop/retail
  • 50. 13. Delete a file ‘customers’ from the “retail” directory. hadoop fs -rm hadoop/retail/customers ----------------------------------------------------------------- # 14. Ensure this file is no longer in HDFS. hadoop fs -ls hadoop/retail/customers ----------------------------------------------------------------- # 15. Delete all files from the “retail” directory using a wildcard. hadoop fs -rm hadoop/retail/* ----------------------------------------------------------------- # 16. To empty the trash hadoop fs –expunge ----------------------------------------------------------------- # 17. Finally, remove the entire retail directory and all of its contents in HDFS. hadoop fs -rm -r hadoop/retail
  • 51. # 18. Add the purchases.txt file from the local directory named “/home/training/” to the hadoop directory you created in HDFS hadoop fs -copyFromLocal /home/training/purchases.txt hadoop/ ----------------------------------------------------------------- # 19. To view the contents of your text file purchases.txt which is present in your hadoop directory. hadoop fs -cat hadoop/purchases.txt ----------------------------------------------------------------- # 20. Move a directory from one location to other hadoop fs -mv hadoop apache_hadoop ----------------------------------------------------------------- # 21. Add the purchases.txt file from “hadoop” directory which is present in HDFS directory to the directory “data” which is present in your local directory hadoop fs -copyToLocal hadoop/purchases.txt /home/training/data
  • 52. # 26. Default names of owner and group are training,training # Use ‘-chown’ to change owner name and group name simultaneously hadoop fs -ls hadoop/purchases.txt sudo -u hdfs hadoop fs -chown root:root hadoop/purchases.txt ----------------------------------------------------------------- # 27. Default name of group is training # Use ‘-chgrp’ command to change group name hadoop fs -ls hadoop/purchases.txt sudo -u hdfs hadoop fs -chgrp training hadoop/purchases.txt ----------------------------------------------------------------- # 28. Default replication factor to a file is 3. # Use ‘-setrep’ command to change replication factor of a file hadoop fs -setrep -w 2 apache_hadoop/sample.txt
  • 53. 29. Copy a directory from one node in the cluster to another # Use ‘-distcp’ command to copy, # Use '-overwrite' option to overwrite in an existing files # Use '-update' command to synchronize both directories hadoop fs -distcp hdfs://namenodeA/apache_hadoop hdfs://namenodeB/hadoop ----------------------------------------------------------------- # 30. Command to make the name node leave safe mode hadoop fs -expunge sudo -u hdfs hdfs dfsadmin -safemode leave ----------------------------------------------------------------- # 31. List all the hadoop file system shell commands hadoop fs ----------------------------------------------------------------- # 33. Last but not least, always ask for help! hadoop fs -help