SlideShare a Scribd company logo
1 of 39
- Objectives
- Contents:
• Big data
• Apache Hadoop
• Examples using Hadoop
- Demo
- Q&A
- References
Security Classification: Internal
Objectives
Big data and Hadoop
introduction 3
• Big data overview.
• Apache Hadoop common architecture:
– Read/write a file in Hadoop File System
– How Hadoop MapReduce tasks work
– Hadoop 1 & 2 difference
• Develop a MapReduce job using Hadoop
• Apply Hadoop in the real world
Security Classification: Internal
Big data – Information explosion
Big data and Hadoop
introduction 5
Security Classification: Internal
Big data – Definition
Big data and Hadoop
introduction 6
“Big data is high-volume, high-velocity
and/or high-variety information assets that
demand cost-effective, innovative forms of
information processing that enable
enhanced insight, decision making, and
process automation”
- Gartner
Security Classification: Internal
Big data – The 3Vs
Big data and Hadoop
introduction 7
• Volume :
– Google receives over 2 million search queries every minute
– transactional data or sensor data are being stored every fraction of
seconds
• Variety :
– YouTube, Facebook generate video, audio, image and text data
– Over 200 million emails are sent every minute
• Velocity:
– Experiments at CERN generate colossal amounts of data.
– Particles collide 600 million times per second.
– Their Data Center processes about one petabyte of data every day.
Security Classification: Internal
Big data – Challenges
Big data and Hadoop
introduction 8
• Difficult in identifying the right data and determining how to best
use it.
• Struggling to find the right talent.
• Data access and connectivity obstacle.
• Data technology landscape is evolving extremely fast.
• Finding new ways of collaborating across functions and
businesses.
• Security concerns.
Security Classification: Internal
Big data – Landscape
Big data and Hadoop
introduction 9
Security Classification: Internal
Big data – Plays part in firm’s revenue
Big data and Hadoop
introduction 10
Security Classification: Internal
Apache Hadoop – What?
Big data and Hadoop
introduction 12
• It is a software platform:
– allows us easily write and run data related applications
– facilitates processing and manipulating massive amount of
data
– the processes are conveniently scalable
Security Classification: Internal
Apache Hadoop – Brief history
Big data and Hadoop
introduction 13
Security Classification: Internal
Apache Hadoop – Characteristics
Big data and Hadoop
introduction 14
• Reliable shared storage (HDFS) and analysis system
(MapReduce).
• Highly scalable
• Cost effective as it can work with commodity hardware.
• Highly flexible and can process both structured as well as
unstructured data.
• Built-in fault tolerance.
• Write once and read multiple times.
• Optimized for large and very large data sets.
Security Classification: Internal
Apache Hadoop – Design principals
Big data and Hadoop
introduction 15
• Moving computation is cheaper than moving data
• Hardware will fail, manage it
• Hide execution details from the user
• Use streaming data access
• Use a simple file system coherency model
Security Classification: Internal
Apache Hadoop – Core architecture (1)
Big data and Hadoop
introduction 16
Security Classification: Internal
Apache Hadoop – Core architecture (2)
Big data and Hadoop
introduction 17
Security Classification: Internal
Apache Hadoop – HDFS architecture
Big data and Hadoop
introduction 18
Security Classification: Internal
Apache Hadoop – HDFS architecture - Replication
Big data and Hadoop
introduction 19
Security Classification: Internal
Apache Hadoop – HDFS architecture – Secondary namenode
Big data and Hadoop
introduction 20
Security Classification: Internal
Apache Hadoop – HDFS – Read a file
Big data and Hadoop
introduction 21
Security Classification: Internal
Apache Hadoop – HDFS – Write a file (1)
Big data and Hadoop
introduction 22
Security Classification: Internal
Apache Hadoop – HDFS – Write a file (2)
Big data and Hadoop
introduction 23
Security Classification: Internal
How MapReduce pattern works
Big data and Hadoop
introduction 24
Security Classification: Internal
Apache Hadoop – Running jobs In Hadoop 1
Big data and Hadoop
introduction 25
Security Classification: Internal
Apache Hadoop – Running jobs in Hadoop 1 – How it works
Big data and Hadoop
introduction 26
Security Classification: Internal
Apache Hadoop – Running jobs In Hadoop 2
Big data and Hadoop
introduction 27
Security Classification: Internal
Apache Hadoop – Running Jobs In Hadoop 2 – How it works
Big data and Hadoop
introduction 28
Security Classification: Internal
Apache Hadoop – Using
Big data and Hadoop
introduction 29
• When to use Hadoop:
– Hadoop can be used in various scenarios including some of the following:
– Analytics
– Search
– Data Retention
– Log file processing
– Analysis of Text, Image, Audio, & Video content
– Recommendation systems like in E-Commerce Websites
• When Not to Use Hadoop:
– Low-latency or near real-time data access.
– Having a large number of small files to be processed.
– Multiple writes scenario or scenarios requiring arbitrary writes or writes between
the files
Security Classification: Internal
Apache Hadoop – Ecosystem
Big data and Hadoop
introduction 30
Security Classification: Internal
Examples using Hadoop – A retail management system
Big data and Hadoop
introduction 32
Security Classification: Internal
Examples using Hadoop – SQL Server and Hadoop
Big data and Hadoop
introduction 33
Security Classification: Internal
Real world applications/solutions using Hadoop – MS HDInsight
Big data and Hadoop
introduction 34
Security Classification: Internal
Real world applications/solutions using Hadoop – Case studies
Big data and Hadoop
introduction 35
Security Classification: Internal
References
Big data and Hadoop
introduction 38
- http://hadoop.apache.org
- Hadoop in action – Chuck Lam
- Hadoop: The definitive guide – Tom White
- http://www.bigdatanews.com/
- http://stackoverflow.com
- http://codeproject.com
- Hadoop 2 Fundamentals – LiveLession
Big data and Hadoop introduction

More Related Content

What's hot

Introduction to Big Data & Hadoop Architecture - Module 1
Introduction to Big Data & Hadoop Architecture - Module 1Introduction to Big Data & Hadoop Architecture - Module 1
Introduction to Big Data & Hadoop Architecture - Module 1Rohit Agrawal
 
Tcloud Computing Hadoop Family and Ecosystem Service 2013.Q3
Tcloud Computing Hadoop Family and Ecosystem Service 2013.Q3Tcloud Computing Hadoop Family and Ecosystem Service 2013.Q3
Tcloud Computing Hadoop Family and Ecosystem Service 2013.Q3tcloudcomputing-tw
 
Introduction to Hadoop and MapReduce
Introduction to Hadoop and MapReduceIntroduction to Hadoop and MapReduce
Introduction to Hadoop and MapReduceeakasit_dpu
 
Hadoop project design and a usecase
Hadoop project design and  a usecaseHadoop project design and  a usecase
Hadoop project design and a usecasesudhakara st
 
Introduction to Apache Hadoop Eco-System
Introduction to Apache Hadoop Eco-SystemIntroduction to Apache Hadoop Eco-System
Introduction to Apache Hadoop Eco-SystemMd. Hasan Basri (Angel)
 
What are Hadoop Components? Hadoop Ecosystem and Architecture | Edureka
What are Hadoop Components? Hadoop Ecosystem and Architecture | EdurekaWhat are Hadoop Components? Hadoop Ecosystem and Architecture | Edureka
What are Hadoop Components? Hadoop Ecosystem and Architecture | EdurekaEdureka!
 
Big data Hadoop Analytic and Data warehouse comparison guide
Big data Hadoop Analytic and Data warehouse comparison guideBig data Hadoop Analytic and Data warehouse comparison guide
Big data Hadoop Analytic and Data warehouse comparison guideDanairat Thanabodithammachari
 
Introduction to Bigdata and HADOOP
Introduction to Bigdata and HADOOP Introduction to Bigdata and HADOOP
Introduction to Bigdata and HADOOP vinoth kumar
 
Big Data & Hadoop Tutorial
Big Data & Hadoop TutorialBig Data & Hadoop Tutorial
Big Data & Hadoop TutorialEdureka!
 
Apache Hadoop
Apache HadoopApache Hadoop
Apache HadoopAjit Koti
 
Introduction to Apache Hadoop Ecosystem
Introduction to Apache Hadoop EcosystemIntroduction to Apache Hadoop Ecosystem
Introduction to Apache Hadoop EcosystemMahabubur Rahaman
 
20131205 hadoop-hdfs-map reduce-introduction
20131205 hadoop-hdfs-map reduce-introduction20131205 hadoop-hdfs-map reduce-introduction
20131205 hadoop-hdfs-map reduce-introductionXuan-Chao Huang
 
Apache hadoop introduction and architecture
Apache hadoop  introduction and architectureApache hadoop  introduction and architecture
Apache hadoop introduction and architectureHarikrishnan K
 
Tcloud Computing Hadoop Family and Ecosystem Service 2013.Q2
Tcloud Computing Hadoop Family and Ecosystem Service 2013.Q2Tcloud Computing Hadoop Family and Ecosystem Service 2013.Q2
Tcloud Computing Hadoop Family and Ecosystem Service 2013.Q2tcloudcomputing-tw
 
Big Data technology Landscape
Big Data technology LandscapeBig Data technology Landscape
Big Data technology LandscapeShivanandaVSeeri
 

What's hot (20)

Hadoop and Big Data
Hadoop and Big DataHadoop and Big Data
Hadoop and Big Data
 
Introduction to Big Data & Hadoop Architecture - Module 1
Introduction to Big Data & Hadoop Architecture - Module 1Introduction to Big Data & Hadoop Architecture - Module 1
Introduction to Big Data & Hadoop Architecture - Module 1
 
Tcloud Computing Hadoop Family and Ecosystem Service 2013.Q3
Tcloud Computing Hadoop Family and Ecosystem Service 2013.Q3Tcloud Computing Hadoop Family and Ecosystem Service 2013.Q3
Tcloud Computing Hadoop Family and Ecosystem Service 2013.Q3
 
Apache Hadoop at 10
Apache Hadoop at 10Apache Hadoop at 10
Apache Hadoop at 10
 
Introduction to Hadoop and MapReduce
Introduction to Hadoop and MapReduceIntroduction to Hadoop and MapReduce
Introduction to Hadoop and MapReduce
 
Hadoop project design and a usecase
Hadoop project design and  a usecaseHadoop project design and  a usecase
Hadoop project design and a usecase
 
Introduction to Apache Hadoop Eco-System
Introduction to Apache Hadoop Eco-SystemIntroduction to Apache Hadoop Eco-System
Introduction to Apache Hadoop Eco-System
 
What are Hadoop Components? Hadoop Ecosystem and Architecture | Edureka
What are Hadoop Components? Hadoop Ecosystem and Architecture | EdurekaWhat are Hadoop Components? Hadoop Ecosystem and Architecture | Edureka
What are Hadoop Components? Hadoop Ecosystem and Architecture | Edureka
 
Big data Hadoop Analytic and Data warehouse comparison guide
Big data Hadoop Analytic and Data warehouse comparison guideBig data Hadoop Analytic and Data warehouse comparison guide
Big data Hadoop Analytic and Data warehouse comparison guide
 
Big data Hadoop
Big data  Hadoop   Big data  Hadoop
Big data Hadoop
 
Introduction to Bigdata and HADOOP
Introduction to Bigdata and HADOOP Introduction to Bigdata and HADOOP
Introduction to Bigdata and HADOOP
 
Big Data & Hadoop Tutorial
Big Data & Hadoop TutorialBig Data & Hadoop Tutorial
Big Data & Hadoop Tutorial
 
Apache Hadoop
Apache HadoopApache Hadoop
Apache Hadoop
 
Introduction to Apache Hadoop Ecosystem
Introduction to Apache Hadoop EcosystemIntroduction to Apache Hadoop Ecosystem
Introduction to Apache Hadoop Ecosystem
 
Hadoop Overview kdd2011
Hadoop Overview kdd2011Hadoop Overview kdd2011
Hadoop Overview kdd2011
 
20131205 hadoop-hdfs-map reduce-introduction
20131205 hadoop-hdfs-map reduce-introduction20131205 hadoop-hdfs-map reduce-introduction
20131205 hadoop-hdfs-map reduce-introduction
 
Apache hadoop introduction and architecture
Apache hadoop  introduction and architectureApache hadoop  introduction and architecture
Apache hadoop introduction and architecture
 
Big data concepts
Big data conceptsBig data concepts
Big data concepts
 
Tcloud Computing Hadoop Family and Ecosystem Service 2013.Q2
Tcloud Computing Hadoop Family and Ecosystem Service 2013.Q2Tcloud Computing Hadoop Family and Ecosystem Service 2013.Q2
Tcloud Computing Hadoop Family and Ecosystem Service 2013.Q2
 
Big Data technology Landscape
Big Data technology LandscapeBig Data technology Landscape
Big Data technology Landscape
 

Viewers also liked

T-SQL performance improvement - session 2 - Owned copy
T-SQL performance improvement - session 2 - Owned copyT-SQL performance improvement - session 2 - Owned copy
T-SQL performance improvement - session 2 - Owned copyDzung Nguyen
 
R Hadoop integration
R Hadoop integrationR Hadoop integration
R Hadoop integrationDzung Nguyen
 
JIRA Service Desk + ChatOps Webinar Deck
JIRA Service Desk + ChatOps Webinar DeckJIRA Service Desk + ChatOps Webinar Deck
JIRA Service Desk + ChatOps Webinar DeckAddteq
 
Hadoop installation, Configuration, and Mapreduce program
Hadoop installation, Configuration, and Mapreduce programHadoop installation, Configuration, and Mapreduce program
Hadoop installation, Configuration, and Mapreduce programPraveen Kumar Donta
 
Introduction to BIg Data and Hadoop
Introduction to BIg Data and HadoopIntroduction to BIg Data and Hadoop
Introduction to BIg Data and HadoopAmir Shaikh
 
Introduction to Hadoop
Introduction to HadoopIntroduction to Hadoop
Introduction to Hadoopjoelcrabb
 
JIRA Service Desk - Tokyo, Japan Sept. 26, 2014
JIRA Service Desk - Tokyo, Japan Sept. 26, 2014JIRA Service Desk - Tokyo, Japan Sept. 26, 2014
JIRA Service Desk - Tokyo, Japan Sept. 26, 2014Adam Laskowski
 
Fitbit presentation
Fitbit presentationFitbit presentation
Fitbit presentationjryan39
 
Hadoop demo ppt
Hadoop demo pptHadoop demo ppt
Hadoop demo pptPhil Young
 
Seminar Presentation Hadoop
Seminar Presentation HadoopSeminar Presentation Hadoop
Seminar Presentation HadoopVarun Narang
 
Hadoop introduction , Why and What is Hadoop ?
Hadoop introduction , Why and What is  Hadoop ?Hadoop introduction , Why and What is  Hadoop ?
Hadoop introduction , Why and What is Hadoop ?sudhakara st
 

Viewers also liked (14)

T-SQL performance improvement - session 2 - Owned copy
T-SQL performance improvement - session 2 - Owned copyT-SQL performance improvement - session 2 - Owned copy
T-SQL performance improvement - session 2 - Owned copy
 
R Hadoop integration
R Hadoop integrationR Hadoop integration
R Hadoop integration
 
JIRA Service Desk + ChatOps Webinar Deck
JIRA Service Desk + ChatOps Webinar DeckJIRA Service Desk + ChatOps Webinar Deck
JIRA Service Desk + ChatOps Webinar Deck
 
Hadoop installation, Configuration, and Mapreduce program
Hadoop installation, Configuration, and Mapreduce programHadoop installation, Configuration, and Mapreduce program
Hadoop installation, Configuration, and Mapreduce program
 
Introduction to BIg Data and Hadoop
Introduction to BIg Data and HadoopIntroduction to BIg Data and Hadoop
Introduction to BIg Data and Hadoop
 
Introduction to Hadoop
Introduction to HadoopIntroduction to Hadoop
Introduction to Hadoop
 
JIRA Service Desk - Tokyo, Japan Sept. 26, 2014
JIRA Service Desk - Tokyo, Japan Sept. 26, 2014JIRA Service Desk - Tokyo, Japan Sept. 26, 2014
JIRA Service Desk - Tokyo, Japan Sept. 26, 2014
 
FitBit
FitBitFitBit
FitBit
 
Hadoop
HadoopHadoop
Hadoop
 
Fitbit presentation
Fitbit presentationFitbit presentation
Fitbit presentation
 
Hadoop demo ppt
Hadoop demo pptHadoop demo ppt
Hadoop demo ppt
 
Seminar Presentation Hadoop
Seminar Presentation HadoopSeminar Presentation Hadoop
Seminar Presentation Hadoop
 
Hadoop introduction , Why and What is Hadoop ?
Hadoop introduction , Why and What is  Hadoop ?Hadoop introduction , Why and What is  Hadoop ?
Hadoop introduction , Why and What is Hadoop ?
 
Big data and Hadoop
Big data and HadoopBig data and Hadoop
Big data and Hadoop
 

Similar to Big data and Hadoop introduction

An Introduction of Apache Hadoop
An Introduction of Apache HadoopAn Introduction of Apache Hadoop
An Introduction of Apache HadoopKMS Technology
 
Big data - Online Training
Big data - Online TrainingBig data - Online Training
Big data - Online TrainingLearntek1
 
M. Florence Dayana - Hadoop Foundation for Analytics.pptx
M. Florence Dayana - Hadoop Foundation for Analytics.pptxM. Florence Dayana - Hadoop Foundation for Analytics.pptx
M. Florence Dayana - Hadoop Foundation for Analytics.pptxDr.Florence Dayana
 
Hadoop Security Features That make your risk officer happy
Hadoop Security Features That make your risk officer happyHadoop Security Features That make your risk officer happy
Hadoop Security Features That make your risk officer happyDataWorks Summit
 
Hadoop Security Features that make your risk officer happy
Hadoop Security Features that make your risk officer happyHadoop Security Features that make your risk officer happy
Hadoop Security Features that make your risk officer happyAnurag Shrivastava
 
hadoop distributed file systems complete information
hadoop distributed file systems complete informationhadoop distributed file systems complete information
hadoop distributed file systems complete informationbhargavi804095
 
Architecting Applications with Hadoop
Architecting Applications with HadoopArchitecting Applications with Hadoop
Architecting Applications with Hadoopmarkgrover
 
Rapid Cluster Computing with Apache Spark 2016
Rapid Cluster Computing with Apache Spark 2016Rapid Cluster Computing with Apache Spark 2016
Rapid Cluster Computing with Apache Spark 2016Zohar Elkayam
 
Application Architectures with Hadoop
Application Architectures with HadoopApplication Architectures with Hadoop
Application Architectures with Hadoophadooparchbook
 

Similar to Big data and Hadoop introduction (20)

An Introduction of Apache Hadoop
An Introduction of Apache HadoopAn Introduction of Apache Hadoop
An Introduction of Apache Hadoop
 
201305 hadoop jpl-v3
201305 hadoop jpl-v3201305 hadoop jpl-v3
201305 hadoop jpl-v3
 
List of Engineering Colleges in Uttarakhand
List of Engineering Colleges in UttarakhandList of Engineering Colleges in Uttarakhand
List of Engineering Colleges in Uttarakhand
 
Hadoop.pptx
Hadoop.pptxHadoop.pptx
Hadoop.pptx
 
Hadoop.pptx
Hadoop.pptxHadoop.pptx
Hadoop.pptx
 
Big data - Online Training
Big data - Online TrainingBig data - Online Training
Big data - Online Training
 
Hadoop ppt1
Hadoop ppt1Hadoop ppt1
Hadoop ppt1
 
M. Florence Dayana - Hadoop Foundation for Analytics.pptx
M. Florence Dayana - Hadoop Foundation for Analytics.pptxM. Florence Dayana - Hadoop Foundation for Analytics.pptx
M. Florence Dayana - Hadoop Foundation for Analytics.pptx
 
Hadoop training
Hadoop trainingHadoop training
Hadoop training
 
Hadoop jon
Hadoop jonHadoop jon
Hadoop jon
 
Big data and hadoop
Big data and hadoopBig data and hadoop
Big data and hadoop
 
Hadoop Security Features That make your risk officer happy
Hadoop Security Features That make your risk officer happyHadoop Security Features That make your risk officer happy
Hadoop Security Features That make your risk officer happy
 
Hadoop Security Features that make your risk officer happy
Hadoop Security Features that make your risk officer happyHadoop Security Features that make your risk officer happy
Hadoop Security Features that make your risk officer happy
 
Hadoop in a Nutshell
Hadoop in a NutshellHadoop in a Nutshell
Hadoop in a Nutshell
 
hadoop distributed file systems complete information
hadoop distributed file systems complete informationhadoop distributed file systems complete information
hadoop distributed file systems complete information
 
Architecting Applications with Hadoop
Architecting Applications with HadoopArchitecting Applications with Hadoop
Architecting Applications with Hadoop
 
Hadoop in action
Hadoop in actionHadoop in action
Hadoop in action
 
Hadoop seminar
Hadoop seminarHadoop seminar
Hadoop seminar
 
Rapid Cluster Computing with Apache Spark 2016
Rapid Cluster Computing with Apache Spark 2016Rapid Cluster Computing with Apache Spark 2016
Rapid Cluster Computing with Apache Spark 2016
 
Application Architectures with Hadoop
Application Architectures with HadoopApplication Architectures with Hadoop
Application Architectures with Hadoop
 

Big data and Hadoop introduction

Editor's Notes

  1. This definition consists of three parts: Part One: 3Vs (Variety – Velocity – Volume) Part Two: Cost-Effective, Innovative Forms of Information Processing Part Three: Enhanced Insight and Decision Making
  2. Data scientists from some companies break big data into 4 Vs: Volume, Variety, Velocity, Veracity. The others add one more V to the characteristics of big data: Value.
  3. Information about Big data Ecosystem can be found at URL: http://hadoopilluminated.com/hadoop_illuminated/Bigdata_Ecosystem.html
  4. Here are few highlights of the Hadoop Architecture: - Hadoop works in a master-worker / master-slave fashion. - Hadoop has two core components: HDFS and MapReduce. HDFS (Hadoop Distributed File System) offers a highly reliable and distributed storage, and ensures reliability, even on a commodity hardware, by replicating the data across multiple nodes. Unlike a regular file system, when data is pushed to HDFS, it will automatically split into multiple blocks (configurable parameter) and stores/replicates the data across various datanodes. This ensures high availability and fault tolerance. MapReduce offers an analysis system which can perform complex computations on large datasets. This component is responsible for performing all the computations and works by breaking down a large complex computation into multiple tasks and assigns those to individual worker/slave nodes and takes care of coordination and consolidation of results. - The master contains the Namenode and Job Tracker components. Namenode holds the information about all the other nodes in the Hadoop Cluster, files present in the cluster, constituent blocks of files and their locations in the cluster, and other information useful for the operation of the Hadoop Cluster. Job Tracker keeps track of the individual tasks/jobs assigned to each of the nodes and coordinates the exchange of information and results. - Each Worker / Slave contains the Task Tracker and a Datanode components. Task Tracker is responsible for running the task / computation assigned to it. Datanode is responsible for holding the data. The computers present in the cluster can be present in any location and there is no dependency on the location of the physical server. The differences between Hadoop 1 & 2 are: Hadoop 1 limits to 4000 nodes per cluster, Hadoop 2 is up to 10000 nodes per cluster. Hadoop 1 has a Jobtracker bottle-neck, Hadoop 2 has efficient cluster utilization – YARN. Hadoop 1 only supports MapReduce jobs but Hadoop 2 supports more job types.
  5. Part of the core Hadoop project, YARN is the architectural center of Hadoop that allows multiple data processing engines such as interactive SQL, real-time streaming, data science and batch processing to handle data stored in a single platform, unlocking an entirely new approach to analytics. YARN is the prerequisite for Enterprise Hadoop, providing resource management and a central platform to deliver consistent operations, security, and data governance tools across Hadoop clusters. YARN also extends the power of Hadoop to incumbent and new technologies found within the data center so that they can take advantage of cost effective, linear-scale storage and processing.
  6. Namenode The namenode is the commodity hardware that contains the GNU/Linux operating system and the namenode software. It is a software that can be run on commodity hardware. The system having the namenode acts as the master server and it does the following tasks: Manages the file system namespace. Regulates client’s access to files. It also executes file system operations such as renaming, closing, and opening files and directories. Datanode The datanode is a commodity hardware having the GNU/Linux operating system and datanode software. For every node (Commodity hardware/System) in a cluster, there will be a datanode. These nodes manage the data storage of their system. Datanodes perform read-write operations on the file systems, as per client request. They also perform operations such as block creation, deletion, and replication according to the instructions of the namenode. Block Generally the user data is stored in the files of HDFS. The file in a file system will be divided into one or more segments and/or stored in individual data nodes. These file segments are called as blocks. In other words, the minimum amount of data that HDFS can read or write is called a Block. The default block size is 64MB, but it can be increased as per the need to change in HDFS configuration. Hadoop commands reference: The syntax is: hadoop fs –command, with command be either: 1.ls <path> Lists the contents of the directory specified by path, showing the names, permissions, owner, size and modification date for each entry. 2.lsr <path> Behaves like -ls, but recursively displays entries in all subdirectories of path. 3.du <path> Shows disk usage, in bytes, for all the files which match path; filenames are reported with the full HDFS protocol prefix. 4.dus <path> Like -du, but prints a summary of disk usage of all files/directories in the path. 5.mv <src><dest> Moves the file or directory indicated by src to dest, within HDFS. 6.cp <src> <dest> Copies the file or directory identified by src to dest, within HDFS. 7.rm <path> Removes the file or empty directory identified by path. 8.rmr <path> Removes the file or directory identified by path. Recursively deletes any child entries (i.e., files or subdirectories of path). 9.put <localSrc> <dest> Copies the file or directory from the local file system identified by localSrc to dest within the DFS. 10.copyFromLocal <localSrc> <dest> Identical to –put 11.moveFromLocal <localSrc> <dest> Copies the file or directory from the local file system identified by localSrc to dest within HDFS, and then deletes the local copy on success. 12.get [-crc] <src> <localDest> Copies the file or directory in HDFS identified by src to the local file system path identified by localDest. 13.getmerge <src> <localDest> Retrieves all files that match the path src in HDFS, and copies them to a single, merged file in the local file system identified by localDest. 14.cat <filen-ame> Displays the contents of filename on stdout. 15.copyToLocal <src> <localDest> Identical to –get 16.moveToLocal <src> <localDest> Works like -get, but deletes the HDFS copy on success. 17.mkdir <path> Creates a directory named path in HDFS. Creates any parent directories in path that are missing (e.g., mkdir -p in Linux). 18.setrep [-R] [-w] rep <path> Sets the target replication factor for files identified by path to rep. (The actual replication factor will move toward the target over time) 19.touchz <path> Creates a file at path containing the current time as a timestamp. Fails if a file already exists at path, unless the file is already size 0. 20.test -[ezd] <path> Returns 1 if path exists; has zero length; or is a directory or 0 otherwise. 21.stat [format] <path> Prints information about path. Format is a string which accepts file size in blocks (%b), filename (%n), block size (%o), replication (%r), and modification date (%y, %Y). 22.tail [-f] <file2name> Shows the last 1KB of file on stdout. 23.chmod [-R] mode,mode,... <path>... Changes the file permissions associated with one or more objects identified by path.... Performs changes recursively with R. mode is a 3-digit octal mode, or {augo}+/-{rwxX}. Assumes if no scope is specified and does not apply an umask. 24.chown [-R] [owner][:[group]] <path>... Sets the owning user and/or group for files or directories identified by path.... Sets owner recursively if -R is specified. 25.chgrp [-R] group <path>... Sets the owning group for files or directories identified by path.... Sets group recursively if -R is specified. 26.help <cmd-name> Returns usage information for one of the commands listed above. You must omit the leading '-' character in cmd.
  7. When a client is writing data to an HDFS file, its data is first written to a local file as explained in the previous section. Suppose the HDFS file has a replication factor of three. When the local file accumulates a full block of user data, the client retrieves a list of DataNodes from the NameNode. This list contains the DataNodes that will host a replica of that block. The client then flushes the data block to the first DataNode. The first DataNode starts receiving the data in small portions (4 KB), writes each portion to its local repository and transfers that portion to the second DataNode in the list. The second DataNode, in turn starts receiving each portion of the data block, writes that portion to its repository and then flushes that portion to the third DataNode. Finally, the third DataNode writes the data to its local repository. Thus, a DataNode can be receiving data from the previous one in the pipeline and at the same time forwarding data to the next one in the pipeline. Thus, the data is pipelined from one DataNode to the next. * Why the default number of replications is 3? Hadoop is used in clustered environment where we have clusters, each cluster will have multiple racks, each rack will have multiple datanodes. So to make HDFS fault tolerant in the cluster we need to consider following failures: - DataNode failure - Rack failure Chances of Cluster failure is fairly low so let not think about it. In the above cases we need to make sure that - If one DataNode fails, you can get the same data from another DataNode - If the entire Rack fails, you can get the same data from another Rack So now its pretty clear why default replication factor is set to 3, so that not 2 replica goes to same DataNode and at-least 1 replica goes to different Rack to fulfill the above mentioned Fault-Tolerant criteria.
  8. The term "secondary name-node" is somewhat misleading. It is not a name-node in the sense that data-nodes cannot connect to the secondary name-node, and in no event it can replace the primary name-node in case of its failure. The only purpose of the secondary name-node is to perform periodic checkpoints. The secondary name-node periodically downloads current name-node image and edits log files, joins them into new image and uploads the new image back to the (primary and the only) name-node. So if the name-node fails and you can restart it on the same physical node then there is no need to shutdown data-nodes, just the name-node need to be restarted. If you cannot use the old node anymore you will need to copy the latest image somewhere else. The latest image can be found either on the node that used to be the primary before failure if available; or on the secondary name-node. The latter will be the latest checkpoint without subsequent edits logs, that is the most recent name space modifications may be missing there. You will also need to restart the whole cluster in this case.
  9. Step 1: First the Client will open the file by giving a call to open() method on FileSystem object, which for HDFS is an instance of DistributedFileSystem class. Step 2: DistributedFileSystem calls the Namenode, using RPC (Remote Procedure Call), to determine the locations of the blocks for the first few blocks of the file. For each block, the namenode returns the addresses of all the datanodes that have a copy of that block. Client will interact with respective datanodes to read the file. Namenode also provide a token to the client which it shows to data node for authentication. The DistributedFileSystem returns an object of FSDataInputStream(an input stream that supports file seeks) to the client for it to read data from FSDataInputStream in turn wraps a DFSInputStream, which manages the datanode and namenode I/O Step 3: The client then calls read() on the stream. DFSInputStream, which has stored the datanode addresses for the first few blocks in the file, then connects to the first closest datanode for the first block in the file. Step 4: Data is streamed from the datanode back to the client, which calls read() repeatedly on the stream. Step 5: When the end of the block is reached, DFSInputStream will close the connection to the datanode, then find the best datanode for the next block. This happens transparently to the client, which from its point of view is just reading a continuous stream. Step 6: Blocks are read in order, with the DFSInputStream opening new connections to datanodes as the client reads through the stream. It will also call the namnode to retrieve the datanode locations for the next batch of blocks as needed. When the client has finished reading, it calls close() on the FSDataInputStream.
  10. Step 1: The client creates the file by calling create() method on DistributedFileSystem.  Step 2:   DistributedFileSystem makes an RPC call to the namenode to create a new file in the filesystem’s namespace, with no blocks associated with it. The namenode performs various checks to make sure the file doesn’t already exist and that the client has the right permissions to create the file. If these checks pass, the namenode makes a record of the new file; otherwise, file creation fails and the client is thrown an IOException. TheDistributedFileSystem returns an FSDataOutputStream for the client to start writing data to. Step 3: As the client writes data, DFSOutputStream splits it into packets, which it writes to an internal queue, called the data queue. The data queue is consumed by the DataStreamer, which is responsible for asking the namenode to allocate new blocks by picking a list of suitable datanodes to store the replicas. The list of datanodes forms a pipeline, and here we’ll assume the replication level is three, so there are three nodes in the pipeline. TheDataStreamer streams the packets to the first datanode in the pipeline, which stores the packet and forwards it to the second datanode in the pipeline. Step 4: Similarly, the second datanode stores the packet and forwards it to the third (and last) datanode in the pipeline. Step 5: DFSOutputStream also maintains an internal queue of packets that are waiting to be acknowledged by datanodes, called the ack queue. A packet is removed from the ack queue only when it has been acknowledged by all the datanodes in the pipeline. Step 6: When the client has finished writing data, it calls close() on the stream. Step 7: This action flushes all the remaining packets to the datanode pipeline and waits for acknowledgments before contacting the namenode to signal that the file is complete The  namenode  already  knows  which blocks  the  file  is  made  up  of , so it only has to wait for blocks to be minimally replicated before returning successfully.
  11. - There are two procedures: + map filters and sort the data + reduce summarize the data + reduce is not necessary, you can have map only process This facilitates scalability and parallelization - Each job in MapReduce process are processed in datanode: + jobs are simple and nodes perform similar jobs + when combined together, operation could be powerful and even complex + it is necessary to write MapReduce program with great care
  12. HBase - Google BigTable Inspired. Non-relational distributed database. Ramdom, real-time r/w operations in column-oriented very large tables (BDDB: Big Data Data Base). It’s the backing system for MR jobs outputs. It’s the Hadoop database. It’s for backing Hadoop MapReduce jobs with Apache HBase tables. Hive - Data Warehouse infrastructure developed by Facebook. Data summarization, query, and analysis. It’s provides SQL-like language (not SQL92 compliant): HiveQL. Pig - Pig provides an engine for executing data flows in parallel on Hadoop. It includes a language, Pig Latin, for expressing these data flows. Pig Latin includes operators for many of the traditional data operations (join, sort, filter, etc.), as well as the ability for users to develop their own functions for reading, processing, and writing data. Pig runs on Hadoop. It makes use of both the Hadoop Distributed File System, HDFS, and Hadoop’s processing system, MapReduce. Pig uses MapReduce to execute all of its data processing. It compiles the Pig Latin scripts that users write into a series of one or more MapReduce jobs that it then executes. Pig Latin looks different from many of the programming languages you have seen. There are no if statements or for loops in Pig Latin. This is because traditional procedural and object-oriented programming languages describe control flow, and data flow is a side effect of the program. Pig Latin instead focuses on data flow. Zookeeper - It’s a coordination service that gives you the tools you need to write correct distributed applications. ZooKeeper was developed at Yahoo! Research. Several Hadoop projects are already using ZooKeeper to coordinate the cluster and provide highly-available distributed services. Perhaps most famous of those are Apache HBase, Storm, Kafka. ZooKeeper is an application library with two principal implementations of the APIs—Java and C—and a service component implemented in Java that runs on an ensemble of dedicated servers. Zookeeper is for building distributed systems, simplifies the development process, making it more agile and enabling more robust implementations. Back in 2006, Google published a paper on "Chubby", a distributed lock service which gained wide adoption within their data centers. Zookeeper, not surprisingly, is a close clone of Chubby designed to fulfill many of the same roles for HDFS and other Hadoop infrastructure. Mahout - Machine learning library and math library, on top of MapReduce. Sqoop - Sqoop works to transport data from RDBMS to HDFS. Flume - Apache Flume is a distributed, reliable, and available service for efficiently collecting, aggregating, and moving large amounts of streaming data into HDFS Oozie - Apache Oozie is a Java Web application used to schedule Apache Hadoop jobs Ambari – Monitoring & management of Hadoop clusters and nodes
  13. A9.com – Amazon: To build Amazon's product search indices; process millions of sessions daily for analytics, using both the Java and streaming APIs; clusters vary from 1 to 100 nodes. Yahoo! : More than 100,000 CPUs in ~20,000 computers running Hadoop; biggest cluster: 2000 nodes (2*4cpu boxes with 4TB disk each); used to support research for Ad Systems and Web Search AOL : Used for a variety of things ranging from statistics generation to running advanced algorithms for doing behavioral analysis and targeting; cluster size is 50 machines, Intel Xeon, dual processors, dual core, each with 16GB Ram and 800 GB hard-disk giving us a total of 37 TB HDFS capacity. Facebook: To store copies of internal log and dimension data sources and use it as a source for reporting/analytics and machine learning; 320 machine cluster with 2,560 cores and about 1.3 PB raw storage; FOX Interactive Media : 3 X 20 machine cluster (8 cores/machine, 2TB/machine storage) ; 10 machine cluster (8 cores/machine, 1TB/machine storage); Used for log analysis, data mining and machine learning University of Nebraska Lincoln: one medium-sized Hadoop cluster (200TB) to store and serve physics data;