JOSA TechTalks - Big Data on Hadoop

Big Data on Hadoop
Dina Abu Khader

What is Big Data
Data with the 4V’s Properties (Characteristics)
(Volume/Velocity/Variety/Value)
● Volume: Data is too big (Petabytes/Exabytes), exceeds the capacity of
RDBMS databases.
● Velocity: amount of data generated is large, 30TB/day.
● Variety: Structured and unstructured data. Access-logs, DB records,
NoSql documents, images.
● Value: What you are trying to solve, what type of information you want
to get: Live recommendation, analytics, processing large amount of data.
“For every two degrees the temperature goes up, check-ins at ice-cream
shops go up by 2%” - Andrew Hogue, Foursquare.

W Questions on Hadoop
● Why use Hadoop?
● When to use Hadoop?
● What is Hadoop?
● How to setup?

Why/When to use Hadoop
● When you don’t want your answers in real-time.
● Storage trends: Cost per gigabytes is high, datasets are
big.
● Time/Skills: High learning curve.
● Non-confidential data.
● When you are throwing away valuable data.
(Java developers with data science skills are in incredibly high demand)

What is Hadoop
Open source framework for storing and processing large sets of data in a
distributed environment.
Core of Hadoop:
● HDFS - Storage
● YARN - Cluster Resource Manager
● MapReduce - Processing Part
● EcoSystem - Applications

HDFS-Hadoop Distributed File System
Similar to existing distributed file systems, but it runs on
low-cost servers and is highly fault-tolerant.
Goals :
● To overcome hardware failures
● Large datasets, horizontally scalable
● Simple coherency model: Write once read multiple
access model for files
3 S of 4TB (Raid 0) = 12TB => Hadoop replica factor 3 => 12/3 = 4TB

HDFS - continued
Hadoop splits files into small blocks which are distributed
among nodes.
HDFS has NameNode (NN), DataNode (DN)
● NN : Master of system, track location {Filename ,
#Replica, BlockId}
● DN : Store files as blocks.

HDFS Features
● Rack awareness
● Minimal data motion
● Utilities
● Rollback
● StandBy-NN
● Highly operable

HDFS - continued
Hadoop shell commands FS (FileSystem) are very similar to Linux commands
● hadoop fs -ls
● hadoop fs -cat /user/dina.khader/readme
Options:
cat, chgrp, chmod, chown, copyFromLocal, copyToLocal, cp, du, dus, get, ls,
lsr, mkdir, movefromLocal, mv, put, rm, rmr, setrep, stat, tail, test, text,
touchz

YARN
Was introduced in hadoop 2.x
Main components of YARN:
1. ResourceManager
2. NodeManager
3. JobHistoryServer

MapReduce
A programming model for processing large
datasets with a parallel, distributed algorithm
on a cluster.

Eco-System
MapReduce gives data seekers a lot of power and flexibility but it also adds a
lot of complexity.
Therefore, there is a set of tools that make that easier like:
● Hive: SQL-like interface to access data stored on HDFS.
● Pig: Scripting platform to process data.
● Hbase: Column-oriented NoSql DB, well suited for sparse data.
Other Hadoop EcoSystem components:
● Zookeeper: Centralized service for maintaining configuration
information.
● Oozie: Workflow scheduler system to manage Hadoop jobs.
● Sqoop/Flume: Transferring data from RDBMS/other sources into Hadoop.

How to setup Hadoop
● Cloudera (Cloudera Manager)
● Hortonworks (Ambari)
● Manual (Painful :) )
wget http://public-repo-1.hortonworks.com/ambari/centos6/1.x/updates/1.2.5.17/ambari.repo
cp ambari.repo /etc/yum.repos.d
yum install ambari-server
ambari-server setup

Quiz
Assuming the following:
● We have configured 64 MB block size
● Replication factor 3
● Rack-awareness
● File size 224 MB
● 3 servers with 4TB RAID 0
Questions :
Copy the file to HDFS please explain
1. How many blocks will be generated?
2. What is the size of these blocks?
3. What will happen if one node went down?
= (224/64) * 3 = 12 blocks
9 blocks 64MB, 3 blocks 32 MB
Rebalance to nearest server in
Rack.

JOSA TechTalks - Big Data on Hadoop

Recommended

Recommended

More Related Content

What's hot

What's hot (20)

Similar to JOSA TechTalks - Big Data on Hadoop

Similar to JOSA TechTalks - Big Data on Hadoop (20)

More from Jordan Open Source Association

More from Jordan Open Source Association (20)

Recently uploaded

Recently uploaded (20)

JOSA TechTalks - Big Data on Hadoop

Editor's Notes