Apache Hadoop is an open source framework that allows you to process large data sets (a.k.a Big Data) across clusters using simple programming models. This TechTalk will introduce you to real-life usages of Hadoop, so you can better understand when to use it, as well as describing its components and the first steps to setup a Hadoop cluster.
By Dina Abu Khader - System Administrator
YouTube video: http://www.youtube.com/watch?v=pSjP171i-gM
3. What is Big Data
Data with the 4V’s Properties (Characteristics)
(Volume/Velocity/Variety/Value)
● Volume: Data is too big (Petabytes/Exabytes), exceeds the capacity of
RDBMS databases.
● Velocity: amount of data generated is large, 30TB/day.
● Variety: Structured and unstructured data. Access-logs, DB records,
NoSql documents, images.
● Value: What you are trying to solve, what type of information you want
to get: Live recommendation, analytics, processing large amount of data.
“For every two degrees the temperature goes up, check-ins at ice-cream
shops go up by 2%” - Andrew Hogue, Foursquare.
4. W Questions on Hadoop
● Why use Hadoop?
● When to use Hadoop?
● What is Hadoop?
● How to setup?
5. Why/When to use Hadoop
● When you don’t want your answers in real-time.
● Storage trends: Cost per gigabytes is high, datasets are
big.
● Time/Skills: High learning curve.
● Non-confidential data.
● When you are throwing away valuable data.
(Java developers with data science skills are in incredibly high demand)
6. What is Hadoop
Open source framework for storing and processing large sets of data in a
distributed environment.
Core of Hadoop:
● HDFS - Storage
● YARN - Cluster Resource Manager
● MapReduce - Processing Part
● EcoSystem - Applications
8. HDFS-Hadoop Distributed File System
Similar to existing distributed file systems, but it runs on
low-cost servers and is highly fault-tolerant.
Goals :
● To overcome hardware failures
● Large datasets, horizontally scalable
● Simple coherency model: Write once read multiple
access model for files
3 S of 4TB (Raid 0) = 12TB => Hadoop replica factor 3 => 12/3 = 4TB
9. HDFS - continued
Hadoop splits files into small blocks which are distributed
among nodes.
HDFS has NameNode (NN), DataNode (DN)
● NN : Master of system, track location {Filename ,
#Replica, BlockId}
● DN : Store files as blocks.
10.
11. HDFS Features
● Rack awareness
● Minimal data motion
● Utilities
● Rollback
● StandBy-NN
● Highly operable
12. HDFS - continued
Hadoop shell commands FS (FileSystem) are very similar to Linux commands
● hadoop fs -ls
● hadoop fs -cat /user/dina.khader/readme
Options:
cat, chgrp, chmod, chown, copyFromLocal, copyToLocal, cp, du, dus, get, ls,
lsr, mkdir, movefromLocal, mv, put, rm, rmr, setrep, stat, tail, test, text,
touchz
13. YARN
Was introduced in hadoop 2.x
Main components of YARN:
1. ResourceManager
2. NodeManager
3. JobHistoryServer
16. Eco-System
MapReduce gives data seekers a lot of power and flexibility but it also adds a
lot of complexity.
Therefore, there is a set of tools that make that easier like:
● Hive: SQL-like interface to access data stored on HDFS.
● Pig: Scripting platform to process data.
● Hbase: Column-oriented NoSql DB, well suited for sparse data.
Other Hadoop EcoSystem components:
● Zookeeper: Centralized service for maintaining configuration
information.
● Oozie: Workflow scheduler system to manage Hadoop jobs.
● Sqoop/Flume: Transferring data from RDBMS/other sources into Hadoop.
18. Quiz
Assuming the following:
● We have configured 64 MB block size
● Replication factor 3
● Rack-awareness
● File size 224 MB
● 3 servers with 4TB RAID 0
Questions :
Copy the file to HDFS please explain
1. How many blocks will be generated?
2. What is the size of these blocks?
3. What will happen if one node went down?
= (224/64) * 3 = 12 blocks
9 blocks 64MB, 3 blocks 32 MB
Rebalance to nearest server in
Rack.
When you want to make predictions based on large historical data.
Ideas Framework to organize works on BigData
Set of rules to work with BigData
Philosophy:
HDFS
MapReduce
Move Processing to Data
The Eco-system
Hadoop is more of a data warehousing system - so it needs a system like MapReduce to actually process the data.
http://hortonworks.com/hadoop/yarn/
User cheap servers and dont worrry
It's designed to be robust, in that your Big Data applications will continue to run even when individual servers — or clusters — fail.
3 S of 4TB (Raid 0) = 12TB => Hadoop replica factor 3 => 12/3 = 4TB
NN: Master of the system, maintain and manage blocks which are on the DN
These specific features ensure that the Hadoop clusters are highly functional and highly available:
Rack awareness allows consideration of a node’s physical location, when allocating storage and scheduling tasks
Minimal data motion. MapReduce moves compute processes to the data on HDFS and not the other way around. Processing tasks can occur on the physical node where the data resides. This significantly reduces the network I/O patterns and keeps most of the I/O on the local disk or within the same rack and provides very high aggregate read/write bandwidth.
Utilities diagnose the health of the files system and can rebalance the data on different nodes
Rollback allows system operators to bring back the previous version of HDFS after an upgrade, in case of human or system errors
An upgrade of HDFS makes a copy of the previous version’s metadata and data. Doing an upgrade does not double the storage requirements of the cluster, as the datanodes use hard links to keep two references (for the current and previous version) to the same block of data. This design makes it straightforward to roll back to the previous version of the filesystem, should you need to. You should understand that any changes made to the data on the upgraded system will be lost after the rollback completes.
You can keep only the previous version of the filesystem: you can’t roll back several versions. Therefore, to carry out another upgrade to HDFS data and metadata, you will need to delete the previous version, a process called finalizing the upgrade. Once an upgrade is finalized, there is no procedure for rolling back to a previous version.
Standby NameNode provides redundancy and supports high availability
Highly operable. Hadoop handles different types of cluster that might otherwise require operator intervention. This design allows a single operator to maintain a cluster of 1000s of nodes.
Check HDFS
http://docs.hortonworks.com/HDPDocuments/HDP2/HDP-2.0.5.0/bk_getting-started-guide/content/ch_hdp2_getting_started_chp2_1.html ??????
split two major responsibility of the JobTracker (Job scheduling/monitoring), MapReduce (Resource Management) into separate daemons: Global ResourceManager, ApplicationManager
Moving data through the network is slow and very expensive on Bandwidth and I/O
10-100 maps/node
Map: is a function which converts items to another kind of list of items.
Reduce: is a function which “collects” the items in lists and performs some computation on all of them thus reducing them to a single value.
MapReduce moves code (Jars) to nodes that have the required data and process them in parallel .
ode (Jars) to nodes that have the required data and process them in parallel .
First of all shuffling is the process of transfering data from the mappers to the reducers, so I think it is obvious that it is necessary for the reducers, since otherwise, they wouldn't be able to have any input (or input from every mapper).
Sorting saves time for the reducer, helping it easily distinguish when a new reduce task should start. It simply starts a new reduce task, when the next key in the sorted input data is different than the previous, to put it simply.
Partitioning, that you mentioned in one of the answers, is a different process. It determines in which reducer a (key, value) pair, output of the map phase, will be sent.
Note that a reducer is different than a reduce task. A reducer can run multiple reduce tasks. Note also that shuffling and sorting are performed locally, by each reducer, for its own input data, whereas partioning is not local.
ZooKeeper is a centralized service for maintaining configuration information, naming, providing distributed synchronization, and providing group services.ZooKeeper is a centralized service for maintaining configuration information, naming, providing distributed synchronization, and providing group services.ZooKeeper is a centralized service for maintaining configuration information, naming, providing distributed synchronization, and providing group services.ZooKeeper is a centralized service for maintaining configuration information, naming, providing distributed synchronization, and providing group services.