First Step for Big Data with Apache Hadoop

By AJ.Surath Kasembunsiri Born2Learn
First Step for Big Data
with Apache Hadoop
By AJ. Surath Kasembunsiri
NCS L.3, CompTIA Security+, ITIL Foundation v.3 , CCNA
MCT, MCITP Enterprise, MCSE +Security +Messaging
2By AJ.Surath Kasembunsiri Born2Learn
Email :
Portfolio:
ÇÔ·ÂÒ¡Ã: Í.ÊØÃÑµ¹ à¡ÉÁºØÈÔÃÔ
 More than 15 years IT experience, 10+ Enterprise consulting in IT
Business area to many company such as EGAT, Trade Siam etc.
 Instructor to many organization with NTSDA Academy, Software Park,
Microsoft, EGAT, ACIS, GSB, Kasikorn Bank and more.
surath@born2learn.net

Agenda
 Overview of Big Data
 Describes about ecosystem of Apache Hadoop
 Hadoop Basic
 How to install and configure Apache Hadoop
Overview of Big Data

What’s Big Data?
 No single definition; here is from Wikipedia:
 Big data is the term for a collection of data sets so large and
complex that it becomes difficult to process using on-hand
database management tools or traditional data processing
applications.
 The challenges include capture, storage, search, sharing, transfer,
analysis, and visualization.
Big Data Every Where
 Lots of data is being collected
and warehoused
Telecom (Call Detail Record)
Web data, e-commerce
purchases at department/
grocery stores
Bank/Credit Card
transactions
Social Network
Health Care

Big Data Data Analytics
Business without analytics
source: http://workpointtv.com/news/4756

A New Set of Questions
How do I optimize my
fleet based on weather
and traffic patterns?
What’s the social
sentiment for my
brand or products
How do I better
predict future
outcomes?
Next industrial revolution (1)

Next industrial revolution (2)
Big Data: 3V’s

Volume (Scale)
 Data Volume
44x increase from 2009 - 2020
From 0.8 zettabytes to 35zb
 Data volume is increasing exponentially
Exponential increase in
collected/generated data
About Data Source
source: www-03.ibm.com/press/us/en/photo/39145.wss

Variety (Complexity)

Velocity (Speed)
 Data is begin
generated fast
and need to
be processed
fast.
Some Make it 4V’s

Harnessing Big Data
 OLTP: Online Transaction Processing (DBMSs)
 OLAP: Online Analytical Processing (Data Warehousing)
 RTAP: Real-Time Analytics Processing (Big Data Architecture &
technology)
How to implement Big Data
 Define Business Question
 Define Data Source
 Define Technology  need Big Data or not?

Database Technology (1)
DATA STORAGAE DATA PROCESS

Describes about ecosystem of
Apache Hadoop

What is Hadoop?
Hadoop Distributed
File System (HDFS)
File Sharing & Data Protection
Across Physical Servers
MapReduce
Distributed Computing Across
Physical Servers
Flexibility
 A single repository for storing
processing & analyzing any type of
data
 Not bound by a single schema
Scalability
 Scale-out architecture divides
workloads across multiple nodes
 Flexible file system eliminates ETL
bottlenecks
Low Cost
 Can be deployed on commodity
hardware
 Open source platform guards against
vendor lock
Hadoop is a platform for data storage
and processing that is…
 Scalable
 Fault tolerant
 Open source (Apache license)
 Written with Java
CORE HADOOP COMPONENTS
Hadoop is not for all type of work
 Not good to process transactions
 Not good when work cannot be parallelized
 Not good for low latency data access
 Not good for processing lots of small files
 Not good for intensively calculation with little data
Source www.bigdatauniversity.com

The Need for Hadoop
 Store and use all types of data
 Process ALL the data; not just a sample
 Scalability to many of nodes
 Commodity hardware
What Makes Hadoop Different?
 Ability to scale out to Petabytes in size using commodity
hardware
 Processing (MapReduce) jobs are sent to the data versus shipping
the data to be processed
 Hadoop doesn’t impose a single data format so it can easily
handle structure, semi-structure and unstructured data
 Manages fault tolerance and data replication automatically

Relational Database vs. Hadoop (1)
Relational Hadoop
Required on write schema Required on Read
Reads are fast speed Writes are fast
Standards and structure governance Loosely structured
Limited, no data processing processing Processing coupled with data
Structured data types Multi and unstructured
Interactive OLAP Analytics
Complex ACID Transactions
Operational Data Store
best fit use Historical/ Archive Data
Processing unstructured data
Massive storage/processing
Relational Database vs. Hadoop (2)

History
 Originally built as a Infrastructure for the “Nutch” project.
 Based on Google’s map reduce and Google File System.
 Created by Doug Cutting in 2005 at Yahoo
 Named after his son’s toy yellow elephant.
Hadoop Timeline

Apache Hadoop Ecosystem (1)
Apache Hadoop Ecosystem (2)
 HDFS Primary Distributed File System
 HBase Column-oriented database scaling to billions of rows
 HCatalog table and storage management layer for Hadoop that enables
Hadoop applications (Pig, MapReduce, and Hive) to use.
 Hive Data warehouse with SQL-like access
 Pig High-level language for expressing data analysis programs
 Mahout Machine Learning Library
 Sqoop Imports data from relational databases
 Flume Collection and import of log and event data
 Oozie Hadoop workflow.
 Zookeeper Centralized service for maintaining configuration
 Ambari Cluster management

Hadoop 1.0 vs Hadoop 2.0
Hadoop 1.0 vs Hadoop 2.0

Hadoop 2.0 Core Components
 HDFS: A scalable and fault tolerant distributed filesystem to data
in any form.
 Yet Another Resource Negotiator (YARN): the cluster
management layer to handle various workloads on the cluster.
 MapReduce: a framework that allows parallel processing of data
in Hadoop
Apache YARN
Data Processing Engines Run Natively IN Hadoop
BATCH
MapReduce
INTERACTIVE
Tez
STREAMING
Storm
GRAPH
Giraph
MICROSOFT
REEF
SAS
LASR, HPA
ONLINE
HBase
OTHERS
HDFS: Redundant, Reliable Storage
YARN: Cluster Resource Management
Flexible
Enables other purpose-built data
processing models beyond
MapReduce (batch), such as
interactive and streaming
Efficient
Double processing IN Hadoop on
the same hardware while
providing predictable
performance & quality of service
Shared
Provides a stable, reliable,
secure foundation and
shared operational services
across multiple workloads
The resource manager for Hadoop 2.0

Many way to Implement Hadoop (1)
Open Source License
Distribution
(Software License)
Cloud Services
Many way to Implement Hadoop (2)
Hadoop Hardware Appliance
 Cisco offers Unified Computing System
 Dell delivers a disk-intensive server (R720XD sever)
 HP "racks" up big data in a box (DL360p sever)
 IBM and Lenovo partner (x3650)
 Oracle Big Data Appliance
 And more..
Source: informationweek.com/big-data/hardware-
architectures/10-hadoop-hardware-leaders

Compare Hadoop Distributions (1)
Compare Hadoop Distributions (2)

Data Warehouse vs. Data Lake
source: www.kdnuggets.com/2015/09/data-lake-vs-
data-warehouse-key-differences.html
ENTERPRISE HADOOP WITH DATA LAKE (1)
Source: http://hortonworks.com/blog/optimize-your-data-architecture-with-hadoop/

ENTERPRISE HADOOP WITH DATA LAKE (2)
Source: http://hortonworks.com/blog/optimize-your-data-architecture-with-hadoop/
How Data Lake Works?

Hadoop Basics
Hadoop 2.0 - Architecture

About Hadoop Cluster
 Both Storage and Processing
- Via Master and Slave nodes
 5 long-running daemons
- Storage
• NameNode (Master)
• Secondary NameNode (Master)
• DataNode (Slaves)
- Processing
• Resource Manager (Master)
• Node Manager (Slaves)
Name Node
Data Node
Hadoop Software Requirement
 Linux
• Redhat (CentsOS), Debian (Ubuntu), Suse
• Native packages & tarballs available from Cloudera & Hortonworks
 Windows
• Windows Server
• Native packages & tarballs available from Hortonworks
 Both Cloudera & Hortonworks package installs also install all
supporting software & setup accounts
• JDK, mapred and hdfs linux users, etc..

Hadoop Run Modes - Benefit
Hadoop Setup – Recommendations (1)
 Use Cloudera
• Cloudera Manager
• Package installs
 Use Linux
 NameNode – Holds all metadata for HDFS
Needs to be a highly reliable machine
RAID drives – typically RAID 10
Dual power supplies
Dual network cards – Bonded
The more memory the better – typical 36GB to - 64GB
 Secondary NameNode – Provides check pointing for the
NameNode. Same hardware as the NameNode should be used

Hadoop Setup – Recommendations (2)
 DataNodes – Hardware will depend on the specific needs of the
cluster
No RAID needed, JBOD (just a bunch of disks) is used
Typical ratio is:
1 hard drive
2 cores
4GB of RAM
About Cloudera
Fastest Path to Success
 No need to write your own scripts or do
integration testing on different components
 Works with a wide range of operating
systems, hardware, databases and data
warehouses
Stable and Reliable
 Extensive Cloudera QA systems, software &
processes
 Tested & run in production at scale
 Proven at scale in dozens of enterprise
environments
Community Driven
 Incorporates only main-line components
from the Apache Hadoop ecosystem – no
forks or proprietary underpinnings
 FREE
Cloudera’s Distribution Including
Apache Hadoop (CDH) is an enterprise-ready
distribution of Hadoop that is…
 100% Apache open source
 Contains all components needed for deployment
 Fully documented and supported
 Released on a reliable schedule

Requirements for Setup Cloudera (1)
 You must have root or password-less sudo access to the hosts
 If using root, the hosts must accept the same root password
 The hosts must have Internet access to allow the wizard to install
software from archive.cloudera.com
 Cluster hosts must have a working network name resolution
system and correctly formatted /etc/hosts file
Requirements for Setup Cloudera (2)
 Must have SSH access to the cluster hosts when you run the
installation or upgrade wizard
 No blocking is done by Security-Enhanced Linux (SELinux)
 IPv6 must be disabled.
 No blocking by iptables or firewalls; port 7180 must be open
because it is used to access Cloudera Manager
 After installation. Cloudera Manager communicates using specific
ports, which must be open.

Hadoop Port Numbers
 More port information on
http://www.cloudera.com/documentation/archive/manager/4-x/4-5-1/Configuring-
Ports-for-Cloudera-Manager-Enterprise-Edition/cmeecp_topic_4.html
How to install and
configure Apache Hadoop

Setup Cloudera (1)
1. Download the installer:
sudo wget http://archive.cloudera.com/cm5/installer/latest/cloudera-
manager-installer.bin
2. Change cloudera-manager-installer.bin to have executable
permission.
sudo chmod u+x cloudera-manager-installer.bin
3. Run the Cloudera Manager Server installer.
sudo ./cloudera-manager-installer.bin
4. Follow wizard until setup finish
Ref: www.cloudera.com/documentation/enterprise/5-4-x/topics/cm_qs_quick_start.html
Setup Cloudera (2)
5. Wait several minutes for the Cloudera Manager Server to
complete its startup.
6. In a web browser, enter http://Server host:7180, where the
Cloudera Manager Server is running. The login screen for Cloudera
Manager Admin Console displays.
7. Log into Cloudera Manager Admin Console with the credentials:
Username: admin Password: admin.
8. Follow instruction setup from Cloudera Manager Wizard until
finish.
Ref: www.cloudera.com/documentation/enterprise/5-4-x/topics/cm_qs_quick_start.html

Lab: Setup Cloudera Cluster on AWS
In this lab, you will see how to:
 Setup Cloudera Cluster on AWS
Cloudera Cluster
First Test on Hadoop (1)
 Use Putty to connect EC2 Server and Test code in below..
ubuntu@ip-172-31-39-55:~$ hdfs dfs -ls /
Found 4 items
drwxr-xr-x - hbase hbase 0 2016-04-30 08:55 /hbase
drwxrwxr-x - solr solr 0 2016-04-30 08:55 /solr
drwxrwxrwt - hdfs supergroup 0 2016-04-30 08:59 /tmp
drwxr-xr-x - hdfs supergroup 0 2016-04-30 09:00 /user
ubuntu@ip-172-31-39-55:~$ sudo -u hdfs hdfs dfs -mkdir /user/ubuntu
ubuntu@ip-172-31-39-55:~$ sudo -u hdfs hdfs dfs -chown ubuntu /user/ubuntu
à¢ŒÒä»´Ùâ¤Ã§ÊÃŒÒ§¢Í§ hdfs
ÊÑ่§ÊÃŒÒ§ directory “ubuntu” â´ÂãªŒÊÔ·¸Ô
¢Í§ “hdfs”
ÊÑ่§á¡Œä¢ directory “ubuntu” â´ÂãËŒÊÔ·¸Ô
owner á´‹ user:ubuntu

First Test on Hadoop (2)
 Use Putty to connect EC2 Server and Test code in below..
ubuntu@ip-172-31-39-55:~$ hdfs dfs -ls /user/
Found 7 items
drwxrwxrwx - mapred hadoop 0 2016-04-30 08:56 /user/history
drwxrwxr-t - hive hive 0 2016-04-30 08:59 /user/hive
drwxrwxr-x - hue hue 0 2016-04-30 09:00 /user/hue
drwxrwxr-x - impala impala 0 2016-04-30 09:00 /user/impala
drwxrwxr-x - oozie oozie 0 2016-04-30 09:01 /user/oozie
drwxr-x--x - spark spark 0 2016-04-30 08:57 /user/spark
drwxr-xr-x - ubuntu supergroup 0 2016-04-30 09:05 /user/ubuntu
ubuntu@ip-172-31-39-55:~$ hdfs dfs -mkdir input-data
ubuntu@ip-172-31-39-55:~$ hdfs dfs -ls /user/ubuntu/
Found 1 items
drwxr-xr-x - ubuntu supergroup 0 2016-04-30 09:16 /user/ubuntu/input-data
à¢ŒÒä»´Ù¢ŒÍÁÙÅã¹ Directory ¢Í§ “user” º¹ HDFS
¾º Directory ·Õ่ÊÃŒÒ§¢Ö้¹“owner”
ÊÑ่§ÊÃŒÒ§ Directory à¾Ô่ÁàµÔÁ
à¢ŒÒä»´Ù¢ŒÍÁÙÅã¹ Directory ¢Í§ “ubuntu” º¹ HDFS
¾º Directory ·Õ่ÊÃŒÒ§¢Ö้¹
Cloudera VM (1)
 Cloudera VM contains a single-node Apache Hadoop cluster along
with everything you need to get started with Hadoop.
 Requirements:
 A 64-bit host OS
 A virtualization software: VMware Player, KVM, or Virtual Box.
 Virtualization Software will require a laptop that supports
virtualization. If you are unsure, one way this can be checked by
looking at your BIOS and seeing if Virtualization is Enabled.
 A 8 GB of total RAM with 2 vCPUs.
 The total system memory required varies depending on the size
of your data set and on the other processes that are running.

Cloudera VM (2)
 Step#1: Download & Run Vmware
 Step#2: Download Cloudera VM
 Step#3: Extract to the Cloudera folder.
 Step#4: Open the "cloudera-quickstart-vm-xx-vmware"
Lab: Setup Cloudera VM
In this lab, you will see how to:
 Setup Cloudera VM
 Open Cloudera VM
Cloudera
(VM)

About Cloudera Manager
The industry’s first
for Apache Hadoop
the
Apache Hadoop stack
Automates the
of Apache Hadoop
DISCOVER DIAGNOSE OPTIMIZEACT
HDFS MAPREDUCE HBASE
ZOOKEEPER OOZIE HUE
Cloudera Manager Interface (1)

Block Size & RF Config. Setting
 Cloudera Manager -> HDFS Services -> Configuration
About Hue (Hadoop User Experience)
 Lightweight Web server that lets you use Hadoop directly from your
browser. (Open source web interface)
 Makes Hadoop platform (HDFS, Map reduce, Hive, etc.) easy to use

Example: Hue Web Interface
Like Me if U Can
youtube.com/Born2Learn TH
facebook.com/Born2LearnTH
www.born2learn.net
surath@born2learn.net

First Step for Big Data with Apache Hadoop

Recommended

Recommended

More Related Content

What's hot

What's hot (20)

Similar to First Step for Big Data with Apache Hadoop

Similar to First Step for Big Data with Apache Hadoop (20)

Recently uploaded

Recently uploaded (20)

First Step for Big Data with Apache Hadoop