SlideShare a Scribd company logo
1 of 37
Download to read offline
By AJ.Surath Kasembunsiri Born2Learn
First Step for Big Data
with Apache Hadoop
By AJ. Surath Kasembunsiri
NCS L.3, CompTIA Security+, ITIL Foundation v.3 , CCNA
MCT, MCITP Enterprise, MCSE +Security +Messaging
2By AJ.Surath Kasembunsiri Born2Learn
Email :
Portfolio:
ÇÔ·ÂÒ¡Ã: Í.ÊØÃѵ¹ à¡ÉÁºØ­ÈÔÃÔ
 More than 15 years IT experience, 10+ Enterprise consulting in IT
Business area to many company such as EGAT, Trade Siam etc.
 Instructor to many organization with NTSDA Academy, Software Park,
Microsoft, EGAT, ACIS, GSB, Kasikorn Bank and more.
surath@born2learn.net
3By AJ.Surath Kasembunsiri Born2Learn
Agenda
 Overview of Big Data
 Describes about ecosystem of Apache Hadoop
 Hadoop Basic
 How to install and configure Apache Hadoop
By AJ.Surath Kasembunsiri Born2Learn
Overview of Big Data
5By AJ.Surath Kasembunsiri Born2Learn
What’s Big Data?
 No single definition; here is from Wikipedia:
 Big data is the term for a collection of data sets so large and
complex that it becomes difficult to process using on-hand
database management tools or traditional data processing
applications.
 The challenges include capture, storage, search, sharing, transfer,
analysis, and visualization.
6By AJ.Surath Kasembunsiri Born2Learn
Big Data Every Where
 Lots of data is being collected
and warehoused
Telecom (Call Detail Record)
Web data, e-commerce
purchases at department/
grocery stores
Bank/Credit Card
transactions
Social Network
Health Care
7By AJ.Surath Kasembunsiri Born2Learn
Big Data Data Analytics
8By AJ.Surath Kasembunsiri Born2Learn
Business without analytics
source: http://workpointtv.com/news/4756
9By AJ.Surath Kasembunsiri Born2Learn
A New Set of Questions
How do I optimize my
fleet based on weather
and traffic patterns?
What’s the social
sentiment for my
brand or products
How do I better
predict future
outcomes?
10By AJ.Surath Kasembunsiri Born2Learn
Next industrial revolution (1)
11By AJ.Surath Kasembunsiri Born2Learn
Next industrial revolution (2)
12By AJ.Surath Kasembunsiri Born2Learn
Big Data: 3V’s
13By AJ.Surath Kasembunsiri Born2Learn
Volume (Scale)
 Data Volume
44x increase from 2009 - 2020
From 0.8 zettabytes to 35zb
 Data volume is increasing exponentially
Exponential increase in
collected/generated data
14By AJ.Surath Kasembunsiri Born2Learn
About Data Source
source: www-03.ibm.com/press/us/en/photo/39145.wss
15By AJ.Surath Kasembunsiri Born2Learn
16By AJ.Surath Kasembunsiri Born2Learn
Variety (Complexity)
17By AJ.Surath Kasembunsiri Born2Learn
Velocity (Speed)
 Data is begin
generated fast
and need to
be processed
fast.
18By AJ.Surath Kasembunsiri Born2Learn
Some Make it 4V’s
19By AJ.Surath Kasembunsiri Born2Learn
Harnessing Big Data
 OLTP: Online Transaction Processing (DBMSs)
 OLAP: Online Analytical Processing (Data Warehousing)
 RTAP: Real-Time Analytics Processing (Big Data Architecture &
technology)
20By AJ.Surath Kasembunsiri Born2Learn
How to implement Big Data
 Define Business Question
 Define Data Source
 Define Technology  need Big Data or not?
21By AJ.Surath Kasembunsiri Born2Learn
Database Technology (1)
22By AJ.Surath Kasembunsiri Born2Learn
Database Technology (2)
DATA STORAGAE DATA PROCESS
23By AJ.Surath Kasembunsiri Born2Learn
Database Technology (3)
By AJ.Surath Kasembunsiri Born2Learn
Describes about ecosystem of
Apache Hadoop
25By AJ.Surath Kasembunsiri Born2Learn
What is Hadoop?
Hadoop Distributed
File System (HDFS)
File Sharing & Data Protection
Across Physical Servers
MapReduce
Distributed Computing Across
Physical Servers
Flexibility
 A single repository for storing
processing & analyzing any type of
data
 Not bound by a single schema
Scalability
 Scale-out architecture divides
workloads across multiple nodes
 Flexible file system eliminates ETL
bottlenecks
Low Cost
 Can be deployed on commodity
hardware
 Open source platform guards against
vendor lock
Hadoop is a platform for data storage
and processing that is…
 Scalable
 Fault tolerant
 Open source (Apache license)
 Written with Java
CORE HADOOP COMPONENTS
26By AJ.Surath Kasembunsiri Born2Learn
Hadoop is not for all type of work
 Not good to process transactions
 Not good when work cannot be parallelized
 Not good for low latency data access
 Not good for processing lots of small files
 Not good for intensively calculation with little data
Source www.bigdatauniversity.com
27By AJ.Surath Kasembunsiri Born2Learn
The Need for Hadoop
 Store and use all types of data
 Process ALL the data; not just a sample
 Scalability to many of nodes
 Commodity hardware
28By AJ.Surath Kasembunsiri Born2Learn
What Makes Hadoop Different?
 Ability to scale out to Petabytes in size using commodity
hardware
 Processing (MapReduce) jobs are sent to the data versus shipping
the data to be processed
 Hadoop doesn’t impose a single data format so it can easily
handle structure, semi-structure and unstructured data
 Manages fault tolerance and data replication automatically
29By AJ.Surath Kasembunsiri Born2Learn
Relational Database vs. Hadoop (1)
Relational Hadoop
Required on write schema Required on Read
Reads are fast speed Writes are fast
Standards and structure governance Loosely structured
Limited, no data processing processing Processing coupled with data
Structured data types Multi and unstructured
Interactive OLAP Analytics
Complex ACID Transactions
Operational Data Store
best fit use Historical/ Archive Data
Processing unstructured data
Massive storage/processing
30By AJ.Surath Kasembunsiri Born2Learn
Relational Database vs. Hadoop (2)
31By AJ.Surath Kasembunsiri Born2Learn
History
 Originally built as a Infrastructure for the “Nutch” project.
 Based on Google’s map reduce and Google File System.
 Created by Doug Cutting in 2005 at Yahoo
 Named after his son’s toy yellow elephant.
32By AJ.Surath Kasembunsiri Born2Learn
Hadoop Timeline
33By AJ.Surath Kasembunsiri Born2Learn
Apache Hadoop Ecosystem (1)
34By AJ.Surath Kasembunsiri Born2Learn
Apache Hadoop Ecosystem (2)
 HDFS Primary Distributed File System
 HBase Column-oriented database scaling to billions of rows
 HCatalog table and storage management layer for Hadoop that enables
Hadoop applications (Pig, MapReduce, and Hive) to use.
 Hive Data warehouse with SQL-like access
 Pig High-level language for expressing data analysis programs
 Mahout Machine Learning Library
 Sqoop Imports data from relational databases
 Flume Collection and import of log and event data
 Oozie Hadoop workflow.
 Zookeeper Centralized service for maintaining configuration
 Ambari Cluster management
35By AJ.Surath Kasembunsiri Born2Learn
Hadoop 1.0 vs Hadoop 2.0
36By AJ.Surath Kasembunsiri Born2Learn
Hadoop 1.0 vs Hadoop 2.0
37By AJ.Surath Kasembunsiri Born2Learn
Hadoop 2.0 Core Components
 HDFS: A scalable and fault tolerant distributed filesystem to data
in any form.
 Yet Another Resource Negotiator (YARN): the cluster
management layer to handle various workloads on the cluster.
 MapReduce: a framework that allows parallel processing of data
in Hadoop
38By AJ.Surath Kasembunsiri Born2Learn
Apache YARN
Data Processing Engines Run Natively IN Hadoop
BATCH
MapReduce
INTERACTIVE
Tez
STREAMING
Storm
GRAPH
Giraph
MICROSOFT
REEF
SAS
LASR, HPA
ONLINE
HBase
OTHERS
HDFS: Redundant, Reliable Storage
YARN: Cluster Resource Management
Flexible
Enables other purpose-built data
processing models beyond
MapReduce (batch), such as
interactive and streaming
Efficient
Double processing IN Hadoop on
the same hardware while
providing predictable
performance & quality of service
Shared
Provides a stable, reliable,
secure foundation and
shared operational services
across multiple workloads
The resource manager for Hadoop 2.0
39By AJ.Surath Kasembunsiri Born2Learn
Many way to Implement Hadoop (1)
Open Source License
Distribution
(Software License)
Cloud Services
40By AJ.Surath Kasembunsiri Born2Learn
Many way to Implement Hadoop (2)
Hadoop Hardware Appliance
 Cisco offers Unified Computing System
 Dell delivers a disk-intensive server (R720XD sever)
 HP "racks" up big data in a box (DL360p sever)
 IBM and Lenovo partner (x3650)
 Oracle Big Data Appliance
 And more..
Source: informationweek.com/big-data/hardware-
architectures/10-hadoop-hardware-leaders
41By AJ.Surath Kasembunsiri Born2Learn
Compare Hadoop Distributions (1)
42By AJ.Surath Kasembunsiri Born2Learn
Compare Hadoop Distributions (2)
43By AJ.Surath Kasembunsiri Born2Learn
Data Warehouse vs. Data Lake
source: www.kdnuggets.com/2015/09/data-lake-vs-
data-warehouse-key-differences.html
44By AJ.Surath Kasembunsiri Born2Learn
ENTERPRISE HADOOP WITH DATA LAKE (1)
Source: http://hortonworks.com/blog/optimize-your-data-architecture-with-hadoop/
45By AJ.Surath Kasembunsiri Born2Learn
ENTERPRISE HADOOP WITH DATA LAKE (2)
Source: http://hortonworks.com/blog/optimize-your-data-architecture-with-hadoop/
46By AJ.Surath Kasembunsiri Born2Learn
How Data Lake Works?
By AJ.Surath Kasembunsiri Born2Learn
Hadoop Basics
48By AJ.Surath Kasembunsiri Born2Learn
Hadoop 2.0 - Architecture
49By AJ.Surath Kasembunsiri Born2Learn
About Hadoop Cluster
 Both Storage and Processing
- Via Master and Slave nodes
 5 long-running daemons
- Storage
• NameNode (Master)
• Secondary NameNode (Master)
• DataNode (Slaves)
- Processing
• Resource Manager (Master)
• Node Manager (Slaves)
Name Node
Data Node
50By AJ.Surath Kasembunsiri Born2Learn
Hadoop Software Requirement
 Linux
• Redhat (CentsOS), Debian (Ubuntu), Suse
• Native packages & tarballs available from Cloudera & Hortonworks
 Windows
• Windows Server
• Native packages & tarballs available from Hortonworks
 Both Cloudera & Hortonworks package installs also install all
supporting software & setup accounts
• JDK, mapred and hdfs linux users, etc..
51By AJ.Surath Kasembunsiri Born2Learn
Hadoop Run Modes - Benefit
52By AJ.Surath Kasembunsiri Born2Learn
Hadoop Setup – Recommendations (1)
 Use Cloudera
• Cloudera Manager
• Package installs
 Use Linux
 NameNode – Holds all metadata for HDFS
Needs to be a highly reliable machine
RAID drives – typically RAID 10
Dual power supplies
Dual network cards – Bonded
The more memory the better – typical 36GB to - 64GB
 Secondary NameNode – Provides check pointing for the
NameNode. Same hardware as the NameNode should be used
53By AJ.Surath Kasembunsiri Born2Learn
Hadoop Setup – Recommendations (2)
 DataNodes – Hardware will depend on the specific needs of the
cluster
No RAID needed, JBOD (just a bunch of disks) is used
Typical ratio is:
1 hard drive
2 cores
4GB of RAM
54By AJ.Surath Kasembunsiri Born2Learn
About Cloudera
Fastest Path to Success
 No need to write your own scripts or do
integration testing on different components
 Works with a wide range of operating
systems, hardware, databases and data
warehouses
Stable and Reliable
 Extensive Cloudera QA systems, software &
processes
 Tested & run in production at scale
 Proven at scale in dozens of enterprise
environments
Community Driven
 Incorporates only main-line components
from the Apache Hadoop ecosystem – no
forks or proprietary underpinnings
 FREE
Cloudera’s Distribution Including
Apache Hadoop (CDH) is an enterprise-ready
distribution of Hadoop that is…
 100% Apache open source
 Contains all components needed for deployment
 Fully documented and supported
 Released on a reliable schedule
55By AJ.Surath Kasembunsiri Born2Learn
Requirements for Setup Cloudera (1)
 You must have root or password-less sudo access to the hosts
 If using root, the hosts must accept the same root password
 The hosts must have Internet access to allow the wizard to install
software from archive.cloudera.com
 Cluster hosts must have a working network name resolution
system and correctly formatted /etc/hosts file
56By AJ.Surath Kasembunsiri Born2Learn
Requirements for Setup Cloudera (2)
 Must have SSH access to the cluster hosts when you run the
installation or upgrade wizard
 No blocking is done by Security-Enhanced Linux (SELinux)
 IPv6 must be disabled.
 No blocking by iptables or firewalls; port 7180 must be open
because it is used to access Cloudera Manager
 After installation. Cloudera Manager communicates using specific
ports, which must be open.
57By AJ.Surath Kasembunsiri Born2Learn
Hadoop Port Numbers
 More port information on
http://www.cloudera.com/documentation/archive/manager/4-x/4-5-1/Configuring-
Ports-for-Cloudera-Manager-Enterprise-Edition/cmeecp_topic_4.html
By AJ.Surath Kasembunsiri Born2Learn
How to install and
configure Apache Hadoop
59By AJ.Surath Kasembunsiri Born2Learn
Setup Cloudera (1)
1. Download the installer:
sudo wget http://archive.cloudera.com/cm5/installer/latest/cloudera-
manager-installer.bin
2. Change cloudera-manager-installer.bin to have executable
permission.
sudo chmod u+x cloudera-manager-installer.bin
3. Run the Cloudera Manager Server installer.
sudo ./cloudera-manager-installer.bin
4. Follow wizard until setup finish
Ref: www.cloudera.com/documentation/enterprise/5-4-x/topics/cm_qs_quick_start.html
60By AJ.Surath Kasembunsiri Born2Learn
Setup Cloudera (2)
5. Wait several minutes for the Cloudera Manager Server to
complete its startup.
6. In a web browser, enter http://Server host:7180, where the
Cloudera Manager Server is running. The login screen for Cloudera
Manager Admin Console displays.
7. Log into Cloudera Manager Admin Console with the credentials:
Username: admin Password: admin.
8. Follow instruction setup from Cloudera Manager Wizard until
finish.
Ref: www.cloudera.com/documentation/enterprise/5-4-x/topics/cm_qs_quick_start.html
61By AJ.Surath Kasembunsiri Born2Learn
Lab: Setup Cloudera Cluster on AWS
In this lab, you will see how to:
 Setup Cloudera Cluster on AWS
Cloudera Cluster
62By AJ.Surath Kasembunsiri Born2Learn
First Test on Hadoop (1)
 Use Putty to connect EC2 Server and Test code in below..
ubuntu@ip-172-31-39-55:~$ hdfs dfs -ls /
Found 4 items
drwxr-xr-x - hbase hbase 0 2016-04-30 08:55 /hbase
drwxrwxr-x - solr solr 0 2016-04-30 08:55 /solr
drwxrwxrwt - hdfs supergroup 0 2016-04-30 08:59 /tmp
drwxr-xr-x - hdfs supergroup 0 2016-04-30 09:00 /user
ubuntu@ip-172-31-39-55:~$ sudo -u hdfs hdfs dfs -mkdir /user/ubuntu
ubuntu@ip-172-31-39-55:~$ sudo -u hdfs hdfs dfs -chown ubuntu /user/ubuntu
ࢌÒä»´Ùâ¤Ã§ÊÌҧ¢Í§ hdfs
ÊÑ่§ÊÌҧ directory “ubuntu” â´Â㪌ÊÔ·¸Ô
¢Í§ “hdfs”
ÊÑ่§á¡Œä¢ directory “ubuntu” â´ÂãËŒÊÔ·¸Ô
owner á´‹ user:ubuntu
63By AJ.Surath Kasembunsiri Born2Learn
First Test on Hadoop (2)
 Use Putty to connect EC2 Server and Test code in below..
ubuntu@ip-172-31-39-55:~$ hdfs dfs -ls /user/
Found 7 items
drwxrwxrwx - mapred hadoop 0 2016-04-30 08:56 /user/history
drwxrwxr-t - hive hive 0 2016-04-30 08:59 /user/hive
drwxrwxr-x - hue hue 0 2016-04-30 09:00 /user/hue
drwxrwxr-x - impala impala 0 2016-04-30 09:00 /user/impala
drwxrwxr-x - oozie oozie 0 2016-04-30 09:01 /user/oozie
drwxr-x--x - spark spark 0 2016-04-30 08:57 /user/spark
drwxr-xr-x - ubuntu supergroup 0 2016-04-30 09:05 /user/ubuntu
ubuntu@ip-172-31-39-55:~$ hdfs dfs -mkdir input-data
ubuntu@ip-172-31-39-55:~$ hdfs dfs -ls /user/ubuntu/
Found 1 items
drwxr-xr-x - ubuntu supergroup 0 2016-04-30 09:16 /user/ubuntu/input-data
ࢌÒä»´Ù¢ŒÍÁÙÅã¹ Directory ¢Í§ “user” º¹ HDFS
¾º Directory ·Õ่ÊÌҧ¢Ö้¹“owner”
ÊÑ่§ÊÌҧ Directory à¾Ô่ÁàµÔÁ
ࢌÒä»´Ù¢ŒÍÁÙÅã¹ Directory ¢Í§ “ubuntu” º¹ HDFS
¾º Directory ·Õ่ÊÌҧ¢Ö้¹
64By AJ.Surath Kasembunsiri Born2Learn
Cloudera VM (1)
 Cloudera VM contains a single-node Apache Hadoop cluster along
with everything you need to get started with Hadoop.
 Requirements:
 A 64-bit host OS
 A virtualization software: VMware Player, KVM, or Virtual Box.
 Virtualization Software will require a laptop that supports
virtualization. If you are unsure, one way this can be checked by
looking at your BIOS and seeing if Virtualization is Enabled.
 A 8 GB of total RAM with 2 vCPUs.
 The total system memory required varies depending on the size
of your data set and on the other processes that are running.
65By AJ.Surath Kasembunsiri Born2Learn
Cloudera VM (2)
 Step#1: Download & Run Vmware
 Step#2: Download Cloudera VM
 Step#3: Extract to the Cloudera folder.
 Step#4: Open the "cloudera-quickstart-vm-xx-vmware"
66By AJ.Surath Kasembunsiri Born2Learn
Lab: Setup Cloudera VM
In this lab, you will see how to:
 Setup Cloudera VM
 Open Cloudera VM
Cloudera
(VM)
67By AJ.Surath Kasembunsiri Born2Learn
About Cloudera Manager
The industry’s first
for Apache Hadoop
the
Apache Hadoop stack
Automates the
of Apache Hadoop
DISCOVER DIAGNOSE OPTIMIZEACT
HDFS MAPREDUCE HBASE
ZOOKEEPER OOZIE HUE
68By AJ.Surath Kasembunsiri Born2Learn
Cloudera Manager Interface (1)
69By AJ.Surath Kasembunsiri Born2Learn
Cloudera Manager Interface (2)
70By AJ.Surath Kasembunsiri Born2Learn
Cloudera Manager Interface (3)
71By AJ.Surath Kasembunsiri Born2Learn
Block Size & RF Config. Setting
 Cloudera Manager -> HDFS Services -> Configuration
72By AJ.Surath Kasembunsiri Born2Learn
About Hue (Hadoop User Experience)
 Lightweight Web server that lets you use Hadoop directly from your
browser. (Open source web interface)
 Makes Hadoop platform (HDFS, Map reduce, Hive, etc.) easy to use
73By AJ.Surath Kasembunsiri Born2Learn
Example: Hue Web Interface
74By AJ.Surath Kasembunsiri Born2Learn
Like Me if U Can
youtube.com/Born2Learn TH
facebook.com/Born2LearnTH
www.born2learn.net
surath@born2learn.net

More Related Content

What's hot

Introduction to Big Data and Hadoop
Introduction to Big Data and HadoopIntroduction to Big Data and Hadoop
Introduction to Big Data and HadoopEdureka!
 
Hadoop MapReduce Framework
Hadoop MapReduce FrameworkHadoop MapReduce Framework
Hadoop MapReduce FrameworkEdureka!
 
BIGDATA- Survey on Scheduling Methods in Hadoop MapReduce
BIGDATA- Survey on Scheduling Methods in Hadoop MapReduceBIGDATA- Survey on Scheduling Methods in Hadoop MapReduce
BIGDATA- Survey on Scheduling Methods in Hadoop MapReduceMahantesh Angadi
 
Hadoop Developer
Hadoop DeveloperHadoop Developer
Hadoop DeveloperEdureka!
 
Presentation on Big Data Hadoop (Summer Training Demo)
Presentation on Big Data Hadoop (Summer Training Demo)Presentation on Big Data Hadoop (Summer Training Demo)
Presentation on Big Data Hadoop (Summer Training Demo)Ashok Royal
 
Hadoop Tutorial | What is Hadoop | Hadoop Project on Reddit | Edureka
Hadoop Tutorial | What is Hadoop | Hadoop Project on Reddit | EdurekaHadoop Tutorial | What is Hadoop | Hadoop Project on Reddit | Edureka
Hadoop Tutorial | What is Hadoop | Hadoop Project on Reddit | EdurekaEdureka!
 
Big Data Analytics Tutorial | Big Data Analytics for Beginners | Hadoop Tutor...
Big Data Analytics Tutorial | Big Data Analytics for Beginners | Hadoop Tutor...Big Data Analytics Tutorial | Big Data Analytics for Beginners | Hadoop Tutor...
Big Data Analytics Tutorial | Big Data Analytics for Beginners | Hadoop Tutor...Edureka!
 
Hadoop hdfs interview questions
Hadoop hdfs interview questionsHadoop hdfs interview questions
Hadoop hdfs interview questionsKalyan Hadoop
 
Introduction and Overview of BigData, Hadoop, Distributed Computing - BigData...
Introduction and Overview of BigData, Hadoop, Distributed Computing - BigData...Introduction and Overview of BigData, Hadoop, Distributed Computing - BigData...
Introduction and Overview of BigData, Hadoop, Distributed Computing - BigData...Mahantesh Angadi
 
Intro to HDFS and MapReduce
Intro to HDFS and MapReduceIntro to HDFS and MapReduce
Intro to HDFS and MapReduceRyan Tabora
 
Bio bigdata
Bio bigdata Bio bigdata
Bio bigdata Mk Kim
 
Introduction to Big data & Hadoop -I
Introduction to Big data & Hadoop -IIntroduction to Big data & Hadoop -I
Introduction to Big data & Hadoop -IEdureka!
 
Jan 2013 HUG: Cloud-Friendly Hadoop and Hive
Jan 2013 HUG: Cloud-Friendly Hadoop and HiveJan 2013 HUG: Cloud-Friendly Hadoop and Hive
Jan 2013 HUG: Cloud-Friendly Hadoop and HiveYahoo Developer Network
 
Hadoop for beginners free course ppt
Hadoop for beginners   free course pptHadoop for beginners   free course ppt
Hadoop for beginners free course pptNjain85
 
Learn Hadoop
Learn HadoopLearn Hadoop
Learn HadoopEdureka!
 
BI, Hive or Big Data Analytics?
BI, Hive or Big Data Analytics? BI, Hive or Big Data Analytics?
BI, Hive or Big Data Analytics? Datameer
 

What's hot (20)

Introduction to Big Data and Hadoop
Introduction to Big Data and HadoopIntroduction to Big Data and Hadoop
Introduction to Big Data and Hadoop
 
Hadoop MapReduce Framework
Hadoop MapReduce FrameworkHadoop MapReduce Framework
Hadoop MapReduce Framework
 
BIGDATA- Survey on Scheduling Methods in Hadoop MapReduce
BIGDATA- Survey on Scheduling Methods in Hadoop MapReduceBIGDATA- Survey on Scheduling Methods in Hadoop MapReduce
BIGDATA- Survey on Scheduling Methods in Hadoop MapReduce
 
Hadoop Developer
Hadoop DeveloperHadoop Developer
Hadoop Developer
 
Presentation on Big Data Hadoop (Summer Training Demo)
Presentation on Big Data Hadoop (Summer Training Demo)Presentation on Big Data Hadoop (Summer Training Demo)
Presentation on Big Data Hadoop (Summer Training Demo)
 
Big Data: hype or necessity?
Big Data: hype or necessity?Big Data: hype or necessity?
Big Data: hype or necessity?
 
Hadoop Tutorial | What is Hadoop | Hadoop Project on Reddit | Edureka
Hadoop Tutorial | What is Hadoop | Hadoop Project on Reddit | EdurekaHadoop Tutorial | What is Hadoop | Hadoop Project on Reddit | Edureka
Hadoop Tutorial | What is Hadoop | Hadoop Project on Reddit | Edureka
 
Big Data Analytics Tutorial | Big Data Analytics for Beginners | Hadoop Tutor...
Big Data Analytics Tutorial | Big Data Analytics for Beginners | Hadoop Tutor...Big Data Analytics Tutorial | Big Data Analytics for Beginners | Hadoop Tutor...
Big Data Analytics Tutorial | Big Data Analytics for Beginners | Hadoop Tutor...
 
Hadoop hdfs interview questions
Hadoop hdfs interview questionsHadoop hdfs interview questions
Hadoop hdfs interview questions
 
Introduction and Overview of BigData, Hadoop, Distributed Computing - BigData...
Introduction and Overview of BigData, Hadoop, Distributed Computing - BigData...Introduction and Overview of BigData, Hadoop, Distributed Computing - BigData...
Introduction and Overview of BigData, Hadoop, Distributed Computing - BigData...
 
Intro to HDFS and MapReduce
Intro to HDFS and MapReduceIntro to HDFS and MapReduce
Intro to HDFS and MapReduce
 
Bio bigdata
Bio bigdata Bio bigdata
Bio bigdata
 
Big Data: an introduction
Big Data: an introductionBig Data: an introduction
Big Data: an introduction
 
Introduction to Big data & Hadoop -I
Introduction to Big data & Hadoop -IIntroduction to Big data & Hadoop -I
Introduction to Big data & Hadoop -I
 
Big Data Hadoop Tutorial by Easylearning Guru
Big Data Hadoop Tutorial by Easylearning GuruBig Data Hadoop Tutorial by Easylearning Guru
Big Data Hadoop Tutorial by Easylearning Guru
 
Jan 2013 HUG: Cloud-Friendly Hadoop and Hive
Jan 2013 HUG: Cloud-Friendly Hadoop and HiveJan 2013 HUG: Cloud-Friendly Hadoop and Hive
Jan 2013 HUG: Cloud-Friendly Hadoop and Hive
 
Hadoop for beginners free course ppt
Hadoop for beginners   free course pptHadoop for beginners   free course ppt
Hadoop for beginners free course ppt
 
Learn Hadoop
Learn HadoopLearn Hadoop
Learn Hadoop
 
BI, Hive or Big Data Analytics?
BI, Hive or Big Data Analytics? BI, Hive or Big Data Analytics?
BI, Hive or Big Data Analytics?
 
Hadoop and Big Data
Hadoop and Big DataHadoop and Big Data
Hadoop and Big Data
 

Similar to First Step for Big Data with Apache Hadoop

EclipseCon Keynote: Apache Hadoop - An Introduction
EclipseCon Keynote: Apache Hadoop - An IntroductionEclipseCon Keynote: Apache Hadoop - An Introduction
EclipseCon Keynote: Apache Hadoop - An IntroductionCloudera, Inc.
 
Hw09 Protein Alignment
Hw09   Protein AlignmentHw09   Protein Alignment
Hw09 Protein AlignmentCloudera, Inc.
 
EMC Isilon Database Converged deck
EMC Isilon Database Converged deckEMC Isilon Database Converged deck
EMC Isilon Database Converged deckKeithETD_CTO
 
Hadoop workshop
Hadoop workshopHadoop workshop
Hadoop workshopFang Mac
 
Big Data - Need of Converged Data Platform
Big Data - Need of Converged Data PlatformBig Data - Need of Converged Data Platform
Big Data - Need of Converged Data PlatformGeekNightHyderabad
 
Introduction to Big Data & Hadoop
Introduction to Big Data & HadoopIntroduction to Big Data & Hadoop
Introduction to Big Data & HadoopEdureka!
 
BIG DATA: From mammoth to elephant
BIG DATA: From mammoth to elephantBIG DATA: From mammoth to elephant
BIG DATA: From mammoth to elephantRoman Nikitchenko
 
Essential Data Engineering for Data Scientist
Essential Data Engineering for Data Scientist Essential Data Engineering for Data Scientist
Essential Data Engineering for Data Scientist SoftServe
 
Run Your First Hadoop 2.x Program
Run Your First Hadoop 2.x ProgramRun Your First Hadoop 2.x Program
Run Your First Hadoop 2.x ProgramSkillspeed
 
Data infrastructure at Facebook
Data infrastructure at Facebook Data infrastructure at Facebook
Data infrastructure at Facebook AhmedDoukh
 
Finding the needles in the haystack. An Overview of Analyzing Big Data with H...
Finding the needles in the haystack. An Overview of Analyzing Big Data with H...Finding the needles in the haystack. An Overview of Analyzing Big Data with H...
Finding the needles in the haystack. An Overview of Analyzing Big Data with H...Chris Baglieri
 
A gentle introduction to the world of BigData and Hadoop
A gentle introduction to the world of BigData and HadoopA gentle introduction to the world of BigData and Hadoop
A gentle introduction to the world of BigData and HadoopStefano Paluello
 
Hadoop : The Pile of Big Data
Hadoop : The Pile of Big DataHadoop : The Pile of Big Data
Hadoop : The Pile of Big DataEdureka!
 
Inroduction to Big Data
Inroduction to Big DataInroduction to Big Data
Inroduction to Big DataOmnia Safaan
 
Big Data in the Cloud - Montreal April 2015
Big Data in the Cloud - Montreal April 2015Big Data in the Cloud - Montreal April 2015
Big Data in the Cloud - Montreal April 2015Cindy Gross
 

Similar to First Step for Big Data with Apache Hadoop (20)

EclipseCon Keynote: Apache Hadoop - An Introduction
EclipseCon Keynote: Apache Hadoop - An IntroductionEclipseCon Keynote: Apache Hadoop - An Introduction
EclipseCon Keynote: Apache Hadoop - An Introduction
 
BIG DATA
BIG DATABIG DATA
BIG DATA
 
Ibm db2 big sql
Ibm db2 big sqlIbm db2 big sql
Ibm db2 big sql
 
Hw09 Protein Alignment
Hw09   Protein AlignmentHw09   Protein Alignment
Hw09 Protein Alignment
 
Final deck
Final deckFinal deck
Final deck
 
EMC Isilon Database Converged deck
EMC Isilon Database Converged deckEMC Isilon Database Converged deck
EMC Isilon Database Converged deck
 
Hadoop workshop
Hadoop workshopHadoop workshop
Hadoop workshop
 
Big Data - Need of Converged Data Platform
Big Data - Need of Converged Data PlatformBig Data - Need of Converged Data Platform
Big Data - Need of Converged Data Platform
 
Introduction to Big Data & Hadoop
Introduction to Big Data & HadoopIntroduction to Big Data & Hadoop
Introduction to Big Data & Hadoop
 
BIG DATA: From mammoth to elephant
BIG DATA: From mammoth to elephantBIG DATA: From mammoth to elephant
BIG DATA: From mammoth to elephant
 
Essential Data Engineering for Data Scientist
Essential Data Engineering for Data Scientist Essential Data Engineering for Data Scientist
Essential Data Engineering for Data Scientist
 
Big Data Hadoop Training by Easylearning Guru
Big Data Hadoop Training by Easylearning GuruBig Data Hadoop Training by Easylearning Guru
Big Data Hadoop Training by Easylearning Guru
 
Run Your First Hadoop 2.x Program
Run Your First Hadoop 2.x ProgramRun Your First Hadoop 2.x Program
Run Your First Hadoop 2.x Program
 
Data infrastructure at Facebook
Data infrastructure at Facebook Data infrastructure at Facebook
Data infrastructure at Facebook
 
Finding the needles in the haystack. An Overview of Analyzing Big Data with H...
Finding the needles in the haystack. An Overview of Analyzing Big Data with H...Finding the needles in the haystack. An Overview of Analyzing Big Data with H...
Finding the needles in the haystack. An Overview of Analyzing Big Data with H...
 
A gentle introduction to the world of BigData and Hadoop
A gentle introduction to the world of BigData and HadoopA gentle introduction to the world of BigData and Hadoop
A gentle introduction to the world of BigData and Hadoop
 
Hadoop : The Pile of Big Data
Hadoop : The Pile of Big DataHadoop : The Pile of Big Data
Hadoop : The Pile of Big Data
 
Inroduction to Big Data
Inroduction to Big DataInroduction to Big Data
Inroduction to Big Data
 
Big Data in the Cloud - Montreal April 2015
Big Data in the Cloud - Montreal April 2015Big Data in the Cloud - Montreal April 2015
Big Data in the Cloud - Montreal April 2015
 
Big data analysis concepts and references
Big data analysis concepts and referencesBig data analysis concepts and references
Big data analysis concepts and references
 

Recently uploaded

Unblocking The Main Thread Solving ANRs and Frozen Frames
Unblocking The Main Thread Solving ANRs and Frozen FramesUnblocking The Main Thread Solving ANRs and Frozen Frames
Unblocking The Main Thread Solving ANRs and Frozen FramesSinan KOZAK
 
Driving Behavioral Change for Information Management through Data-Driven Gree...
Driving Behavioral Change for Information Management through Data-Driven Gree...Driving Behavioral Change for Information Management through Data-Driven Gree...
Driving Behavioral Change for Information Management through Data-Driven Gree...Enterprise Knowledge
 
EIS-Webinar-Prompt-Knowledge-Eng-2024-04-08.pptx
EIS-Webinar-Prompt-Knowledge-Eng-2024-04-08.pptxEIS-Webinar-Prompt-Knowledge-Eng-2024-04-08.pptx
EIS-Webinar-Prompt-Knowledge-Eng-2024-04-08.pptxEarley Information Science
 
WhatsApp 9892124323 ✓Call Girls In Kalyan ( Mumbai ) secure service
WhatsApp 9892124323 ✓Call Girls In Kalyan ( Mumbai ) secure serviceWhatsApp 9892124323 ✓Call Girls In Kalyan ( Mumbai ) secure service
WhatsApp 9892124323 ✓Call Girls In Kalyan ( Mumbai ) secure servicePooja Nehwal
 
TrustArc Webinar - Stay Ahead of US State Data Privacy Law Developments
TrustArc Webinar - Stay Ahead of US State Data Privacy Law DevelopmentsTrustArc Webinar - Stay Ahead of US State Data Privacy Law Developments
TrustArc Webinar - Stay Ahead of US State Data Privacy Law DevelopmentsTrustArc
 
2024: Domino Containers - The Next Step. News from the Domino Container commu...
2024: Domino Containers - The Next Step. News from the Domino Container commu...2024: Domino Containers - The Next Step. News from the Domino Container commu...
2024: Domino Containers - The Next Step. News from the Domino Container commu...Martijn de Jong
 
Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...
Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...
Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...Miguel Araújo
 
Raspberry Pi 5: Challenges and Solutions in Bringing up an OpenGL/Vulkan Driv...
Raspberry Pi 5: Challenges and Solutions in Bringing up an OpenGL/Vulkan Driv...Raspberry Pi 5: Challenges and Solutions in Bringing up an OpenGL/Vulkan Driv...
Raspberry Pi 5: Challenges and Solutions in Bringing up an OpenGL/Vulkan Driv...Igalia
 
The 7 Things I Know About Cyber Security After 25 Years | April 2024
The 7 Things I Know About Cyber Security After 25 Years | April 2024The 7 Things I Know About Cyber Security After 25 Years | April 2024
The 7 Things I Know About Cyber Security After 25 Years | April 2024Rafal Los
 
Developing An App To Navigate The Roads of Brazil
Developing An App To Navigate The Roads of BrazilDeveloping An App To Navigate The Roads of Brazil
Developing An App To Navigate The Roads of BrazilV3cube
 
Neo4j - How KGs are shaping the future of Generative AI at AWS Summit London ...
Neo4j - How KGs are shaping the future of Generative AI at AWS Summit London ...Neo4j - How KGs are shaping the future of Generative AI at AWS Summit London ...
Neo4j - How KGs are shaping the future of Generative AI at AWS Summit London ...Neo4j
 
GenCyber Cyber Security Day Presentation
GenCyber Cyber Security Day PresentationGenCyber Cyber Security Day Presentation
GenCyber Cyber Security Day PresentationMichael W. Hawkins
 
Injustice - Developers Among Us (SciFiDevCon 2024)
Injustice - Developers Among Us (SciFiDevCon 2024)Injustice - Developers Among Us (SciFiDevCon 2024)
Injustice - Developers Among Us (SciFiDevCon 2024)Allon Mureinik
 
Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...
Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...
Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...apidays
 
08448380779 Call Girls In Greater Kailash - I Women Seeking Men
08448380779 Call Girls In Greater Kailash - I Women Seeking Men08448380779 Call Girls In Greater Kailash - I Women Seeking Men
08448380779 Call Girls In Greater Kailash - I Women Seeking MenDelhi Call girls
 
Histor y of HAM Radio presentation slide
Histor y of HAM Radio presentation slideHistor y of HAM Radio presentation slide
Histor y of HAM Radio presentation slidevu2urc
 
Slack Application Development 101 Slides
Slack Application Development 101 SlidesSlack Application Development 101 Slides
Slack Application Development 101 Slidespraypatel2
 
Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...
Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...
Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...Drew Madelung
 
Salesforce Community Group Quito, Salesforce 101
Salesforce Community Group Quito, Salesforce 101Salesforce Community Group Quito, Salesforce 101
Salesforce Community Group Quito, Salesforce 101Paola De la Torre
 
Workshop - Best of Both Worlds_ Combine KG and Vector search for enhanced R...
Workshop - Best of Both Worlds_ Combine  KG and Vector search for  enhanced R...Workshop - Best of Both Worlds_ Combine  KG and Vector search for  enhanced R...
Workshop - Best of Both Worlds_ Combine KG and Vector search for enhanced R...Neo4j
 

Recently uploaded (20)

Unblocking The Main Thread Solving ANRs and Frozen Frames
Unblocking The Main Thread Solving ANRs and Frozen FramesUnblocking The Main Thread Solving ANRs and Frozen Frames
Unblocking The Main Thread Solving ANRs and Frozen Frames
 
Driving Behavioral Change for Information Management through Data-Driven Gree...
Driving Behavioral Change for Information Management through Data-Driven Gree...Driving Behavioral Change for Information Management through Data-Driven Gree...
Driving Behavioral Change for Information Management through Data-Driven Gree...
 
EIS-Webinar-Prompt-Knowledge-Eng-2024-04-08.pptx
EIS-Webinar-Prompt-Knowledge-Eng-2024-04-08.pptxEIS-Webinar-Prompt-Knowledge-Eng-2024-04-08.pptx
EIS-Webinar-Prompt-Knowledge-Eng-2024-04-08.pptx
 
WhatsApp 9892124323 ✓Call Girls In Kalyan ( Mumbai ) secure service
WhatsApp 9892124323 ✓Call Girls In Kalyan ( Mumbai ) secure serviceWhatsApp 9892124323 ✓Call Girls In Kalyan ( Mumbai ) secure service
WhatsApp 9892124323 ✓Call Girls In Kalyan ( Mumbai ) secure service
 
TrustArc Webinar - Stay Ahead of US State Data Privacy Law Developments
TrustArc Webinar - Stay Ahead of US State Data Privacy Law DevelopmentsTrustArc Webinar - Stay Ahead of US State Data Privacy Law Developments
TrustArc Webinar - Stay Ahead of US State Data Privacy Law Developments
 
2024: Domino Containers - The Next Step. News from the Domino Container commu...
2024: Domino Containers - The Next Step. News from the Domino Container commu...2024: Domino Containers - The Next Step. News from the Domino Container commu...
2024: Domino Containers - The Next Step. News from the Domino Container commu...
 
Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...
Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...
Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...
 
Raspberry Pi 5: Challenges and Solutions in Bringing up an OpenGL/Vulkan Driv...
Raspberry Pi 5: Challenges and Solutions in Bringing up an OpenGL/Vulkan Driv...Raspberry Pi 5: Challenges and Solutions in Bringing up an OpenGL/Vulkan Driv...
Raspberry Pi 5: Challenges and Solutions in Bringing up an OpenGL/Vulkan Driv...
 
The 7 Things I Know About Cyber Security After 25 Years | April 2024
The 7 Things I Know About Cyber Security After 25 Years | April 2024The 7 Things I Know About Cyber Security After 25 Years | April 2024
The 7 Things I Know About Cyber Security After 25 Years | April 2024
 
Developing An App To Navigate The Roads of Brazil
Developing An App To Navigate The Roads of BrazilDeveloping An App To Navigate The Roads of Brazil
Developing An App To Navigate The Roads of Brazil
 
Neo4j - How KGs are shaping the future of Generative AI at AWS Summit London ...
Neo4j - How KGs are shaping the future of Generative AI at AWS Summit London ...Neo4j - How KGs are shaping the future of Generative AI at AWS Summit London ...
Neo4j - How KGs are shaping the future of Generative AI at AWS Summit London ...
 
GenCyber Cyber Security Day Presentation
GenCyber Cyber Security Day PresentationGenCyber Cyber Security Day Presentation
GenCyber Cyber Security Day Presentation
 
Injustice - Developers Among Us (SciFiDevCon 2024)
Injustice - Developers Among Us (SciFiDevCon 2024)Injustice - Developers Among Us (SciFiDevCon 2024)
Injustice - Developers Among Us (SciFiDevCon 2024)
 
Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...
Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...
Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...
 
08448380779 Call Girls In Greater Kailash - I Women Seeking Men
08448380779 Call Girls In Greater Kailash - I Women Seeking Men08448380779 Call Girls In Greater Kailash - I Women Seeking Men
08448380779 Call Girls In Greater Kailash - I Women Seeking Men
 
Histor y of HAM Radio presentation slide
Histor y of HAM Radio presentation slideHistor y of HAM Radio presentation slide
Histor y of HAM Radio presentation slide
 
Slack Application Development 101 Slides
Slack Application Development 101 SlidesSlack Application Development 101 Slides
Slack Application Development 101 Slides
 
Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...
Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...
Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...
 
Salesforce Community Group Quito, Salesforce 101
Salesforce Community Group Quito, Salesforce 101Salesforce Community Group Quito, Salesforce 101
Salesforce Community Group Quito, Salesforce 101
 
Workshop - Best of Both Worlds_ Combine KG and Vector search for enhanced R...
Workshop - Best of Both Worlds_ Combine  KG and Vector search for  enhanced R...Workshop - Best of Both Worlds_ Combine  KG and Vector search for  enhanced R...
Workshop - Best of Both Worlds_ Combine KG and Vector search for enhanced R...
 

First Step for Big Data with Apache Hadoop

  • 1. By AJ.Surath Kasembunsiri Born2Learn First Step for Big Data with Apache Hadoop By AJ. Surath Kasembunsiri NCS L.3, CompTIA Security+, ITIL Foundation v.3 , CCNA MCT, MCITP Enterprise, MCSE +Security +Messaging 2By AJ.Surath Kasembunsiri Born2Learn Email : Portfolio: ÇÔ·ÂÒ¡Ã: Í.ÊØÃѵ¹ à¡ÉÁºØ­ÈÔÃÔ  More than 15 years IT experience, 10+ Enterprise consulting in IT Business area to many company such as EGAT, Trade Siam etc.  Instructor to many organization with NTSDA Academy, Software Park, Microsoft, EGAT, ACIS, GSB, Kasikorn Bank and more. surath@born2learn.net
  • 2. 3By AJ.Surath Kasembunsiri Born2Learn Agenda  Overview of Big Data  Describes about ecosystem of Apache Hadoop  Hadoop Basic  How to install and configure Apache Hadoop By AJ.Surath Kasembunsiri Born2Learn Overview of Big Data
  • 3. 5By AJ.Surath Kasembunsiri Born2Learn What’s Big Data?  No single definition; here is from Wikipedia:  Big data is the term for a collection of data sets so large and complex that it becomes difficult to process using on-hand database management tools or traditional data processing applications.  The challenges include capture, storage, search, sharing, transfer, analysis, and visualization. 6By AJ.Surath Kasembunsiri Born2Learn Big Data Every Where  Lots of data is being collected and warehoused Telecom (Call Detail Record) Web data, e-commerce purchases at department/ grocery stores Bank/Credit Card transactions Social Network Health Care
  • 4. 7By AJ.Surath Kasembunsiri Born2Learn Big Data Data Analytics 8By AJ.Surath Kasembunsiri Born2Learn Business without analytics source: http://workpointtv.com/news/4756
  • 5. 9By AJ.Surath Kasembunsiri Born2Learn A New Set of Questions How do I optimize my fleet based on weather and traffic patterns? What’s the social sentiment for my brand or products How do I better predict future outcomes? 10By AJ.Surath Kasembunsiri Born2Learn Next industrial revolution (1)
  • 6. 11By AJ.Surath Kasembunsiri Born2Learn Next industrial revolution (2) 12By AJ.Surath Kasembunsiri Born2Learn Big Data: 3V’s
  • 7. 13By AJ.Surath Kasembunsiri Born2Learn Volume (Scale)  Data Volume 44x increase from 2009 - 2020 From 0.8 zettabytes to 35zb  Data volume is increasing exponentially Exponential increase in collected/generated data 14By AJ.Surath Kasembunsiri Born2Learn About Data Source source: www-03.ibm.com/press/us/en/photo/39145.wss
  • 8. 15By AJ.Surath Kasembunsiri Born2Learn 16By AJ.Surath Kasembunsiri Born2Learn Variety (Complexity)
  • 9. 17By AJ.Surath Kasembunsiri Born2Learn Velocity (Speed)  Data is begin generated fast and need to be processed fast. 18By AJ.Surath Kasembunsiri Born2Learn Some Make it 4V’s
  • 10. 19By AJ.Surath Kasembunsiri Born2Learn Harnessing Big Data  OLTP: Online Transaction Processing (DBMSs)  OLAP: Online Analytical Processing (Data Warehousing)  RTAP: Real-Time Analytics Processing (Big Data Architecture & technology) 20By AJ.Surath Kasembunsiri Born2Learn How to implement Big Data  Define Business Question  Define Data Source  Define Technology  need Big Data or not?
  • 11. 21By AJ.Surath Kasembunsiri Born2Learn Database Technology (1) 22By AJ.Surath Kasembunsiri Born2Learn Database Technology (2) DATA STORAGAE DATA PROCESS
  • 12. 23By AJ.Surath Kasembunsiri Born2Learn Database Technology (3) By AJ.Surath Kasembunsiri Born2Learn Describes about ecosystem of Apache Hadoop
  • 13. 25By AJ.Surath Kasembunsiri Born2Learn What is Hadoop? Hadoop Distributed File System (HDFS) File Sharing & Data Protection Across Physical Servers MapReduce Distributed Computing Across Physical Servers Flexibility  A single repository for storing processing & analyzing any type of data  Not bound by a single schema Scalability  Scale-out architecture divides workloads across multiple nodes  Flexible file system eliminates ETL bottlenecks Low Cost  Can be deployed on commodity hardware  Open source platform guards against vendor lock Hadoop is a platform for data storage and processing that is…  Scalable  Fault tolerant  Open source (Apache license)  Written with Java CORE HADOOP COMPONENTS 26By AJ.Surath Kasembunsiri Born2Learn Hadoop is not for all type of work  Not good to process transactions  Not good when work cannot be parallelized  Not good for low latency data access  Not good for processing lots of small files  Not good for intensively calculation with little data Source www.bigdatauniversity.com
  • 14. 27By AJ.Surath Kasembunsiri Born2Learn The Need for Hadoop  Store and use all types of data  Process ALL the data; not just a sample  Scalability to many of nodes  Commodity hardware 28By AJ.Surath Kasembunsiri Born2Learn What Makes Hadoop Different?  Ability to scale out to Petabytes in size using commodity hardware  Processing (MapReduce) jobs are sent to the data versus shipping the data to be processed  Hadoop doesn’t impose a single data format so it can easily handle structure, semi-structure and unstructured data  Manages fault tolerance and data replication automatically
  • 15. 29By AJ.Surath Kasembunsiri Born2Learn Relational Database vs. Hadoop (1) Relational Hadoop Required on write schema Required on Read Reads are fast speed Writes are fast Standards and structure governance Loosely structured Limited, no data processing processing Processing coupled with data Structured data types Multi and unstructured Interactive OLAP Analytics Complex ACID Transactions Operational Data Store best fit use Historical/ Archive Data Processing unstructured data Massive storage/processing 30By AJ.Surath Kasembunsiri Born2Learn Relational Database vs. Hadoop (2)
  • 16. 31By AJ.Surath Kasembunsiri Born2Learn History  Originally built as a Infrastructure for the “Nutch” project.  Based on Google’s map reduce and Google File System.  Created by Doug Cutting in 2005 at Yahoo  Named after his son’s toy yellow elephant. 32By AJ.Surath Kasembunsiri Born2Learn Hadoop Timeline
  • 17. 33By AJ.Surath Kasembunsiri Born2Learn Apache Hadoop Ecosystem (1) 34By AJ.Surath Kasembunsiri Born2Learn Apache Hadoop Ecosystem (2)  HDFS Primary Distributed File System  HBase Column-oriented database scaling to billions of rows  HCatalog table and storage management layer for Hadoop that enables Hadoop applications (Pig, MapReduce, and Hive) to use.  Hive Data warehouse with SQL-like access  Pig High-level language for expressing data analysis programs  Mahout Machine Learning Library  Sqoop Imports data from relational databases  Flume Collection and import of log and event data  Oozie Hadoop workflow.  Zookeeper Centralized service for maintaining configuration  Ambari Cluster management
  • 18. 35By AJ.Surath Kasembunsiri Born2Learn Hadoop 1.0 vs Hadoop 2.0 36By AJ.Surath Kasembunsiri Born2Learn Hadoop 1.0 vs Hadoop 2.0
  • 19. 37By AJ.Surath Kasembunsiri Born2Learn Hadoop 2.0 Core Components  HDFS: A scalable and fault tolerant distributed filesystem to data in any form.  Yet Another Resource Negotiator (YARN): the cluster management layer to handle various workloads on the cluster.  MapReduce: a framework that allows parallel processing of data in Hadoop 38By AJ.Surath Kasembunsiri Born2Learn Apache YARN Data Processing Engines Run Natively IN Hadoop BATCH MapReduce INTERACTIVE Tez STREAMING Storm GRAPH Giraph MICROSOFT REEF SAS LASR, HPA ONLINE HBase OTHERS HDFS: Redundant, Reliable Storage YARN: Cluster Resource Management Flexible Enables other purpose-built data processing models beyond MapReduce (batch), such as interactive and streaming Efficient Double processing IN Hadoop on the same hardware while providing predictable performance & quality of service Shared Provides a stable, reliable, secure foundation and shared operational services across multiple workloads The resource manager for Hadoop 2.0
  • 20. 39By AJ.Surath Kasembunsiri Born2Learn Many way to Implement Hadoop (1) Open Source License Distribution (Software License) Cloud Services 40By AJ.Surath Kasembunsiri Born2Learn Many way to Implement Hadoop (2) Hadoop Hardware Appliance  Cisco offers Unified Computing System  Dell delivers a disk-intensive server (R720XD sever)  HP "racks" up big data in a box (DL360p sever)  IBM and Lenovo partner (x3650)  Oracle Big Data Appliance  And more.. Source: informationweek.com/big-data/hardware- architectures/10-hadoop-hardware-leaders
  • 21. 41By AJ.Surath Kasembunsiri Born2Learn Compare Hadoop Distributions (1) 42By AJ.Surath Kasembunsiri Born2Learn Compare Hadoop Distributions (2)
  • 22. 43By AJ.Surath Kasembunsiri Born2Learn Data Warehouse vs. Data Lake source: www.kdnuggets.com/2015/09/data-lake-vs- data-warehouse-key-differences.html 44By AJ.Surath Kasembunsiri Born2Learn ENTERPRISE HADOOP WITH DATA LAKE (1) Source: http://hortonworks.com/blog/optimize-your-data-architecture-with-hadoop/
  • 23. 45By AJ.Surath Kasembunsiri Born2Learn ENTERPRISE HADOOP WITH DATA LAKE (2) Source: http://hortonworks.com/blog/optimize-your-data-architecture-with-hadoop/ 46By AJ.Surath Kasembunsiri Born2Learn How Data Lake Works?
  • 24. By AJ.Surath Kasembunsiri Born2Learn Hadoop Basics 48By AJ.Surath Kasembunsiri Born2Learn Hadoop 2.0 - Architecture
  • 25. 49By AJ.Surath Kasembunsiri Born2Learn About Hadoop Cluster  Both Storage and Processing - Via Master and Slave nodes  5 long-running daemons - Storage • NameNode (Master) • Secondary NameNode (Master) • DataNode (Slaves) - Processing • Resource Manager (Master) • Node Manager (Slaves) Name Node Data Node 50By AJ.Surath Kasembunsiri Born2Learn Hadoop Software Requirement  Linux • Redhat (CentsOS), Debian (Ubuntu), Suse • Native packages & tarballs available from Cloudera & Hortonworks  Windows • Windows Server • Native packages & tarballs available from Hortonworks  Both Cloudera & Hortonworks package installs also install all supporting software & setup accounts • JDK, mapred and hdfs linux users, etc..
  • 26. 51By AJ.Surath Kasembunsiri Born2Learn Hadoop Run Modes - Benefit 52By AJ.Surath Kasembunsiri Born2Learn Hadoop Setup – Recommendations (1)  Use Cloudera • Cloudera Manager • Package installs  Use Linux  NameNode – Holds all metadata for HDFS Needs to be a highly reliable machine RAID drives – typically RAID 10 Dual power supplies Dual network cards – Bonded The more memory the better – typical 36GB to - 64GB  Secondary NameNode – Provides check pointing for the NameNode. Same hardware as the NameNode should be used
  • 27. 53By AJ.Surath Kasembunsiri Born2Learn Hadoop Setup – Recommendations (2)  DataNodes – Hardware will depend on the specific needs of the cluster No RAID needed, JBOD (just a bunch of disks) is used Typical ratio is: 1 hard drive 2 cores 4GB of RAM 54By AJ.Surath Kasembunsiri Born2Learn About Cloudera Fastest Path to Success  No need to write your own scripts or do integration testing on different components  Works with a wide range of operating systems, hardware, databases and data warehouses Stable and Reliable  Extensive Cloudera QA systems, software & processes  Tested & run in production at scale  Proven at scale in dozens of enterprise environments Community Driven  Incorporates only main-line components from the Apache Hadoop ecosystem – no forks or proprietary underpinnings  FREE Cloudera’s Distribution Including Apache Hadoop (CDH) is an enterprise-ready distribution of Hadoop that is…  100% Apache open source  Contains all components needed for deployment  Fully documented and supported  Released on a reliable schedule
  • 28. 55By AJ.Surath Kasembunsiri Born2Learn Requirements for Setup Cloudera (1)  You must have root or password-less sudo access to the hosts  If using root, the hosts must accept the same root password  The hosts must have Internet access to allow the wizard to install software from archive.cloudera.com  Cluster hosts must have a working network name resolution system and correctly formatted /etc/hosts file 56By AJ.Surath Kasembunsiri Born2Learn Requirements for Setup Cloudera (2)  Must have SSH access to the cluster hosts when you run the installation or upgrade wizard  No blocking is done by Security-Enhanced Linux (SELinux)  IPv6 must be disabled.  No blocking by iptables or firewalls; port 7180 must be open because it is used to access Cloudera Manager  After installation. Cloudera Manager communicates using specific ports, which must be open.
  • 29. 57By AJ.Surath Kasembunsiri Born2Learn Hadoop Port Numbers  More port information on http://www.cloudera.com/documentation/archive/manager/4-x/4-5-1/Configuring- Ports-for-Cloudera-Manager-Enterprise-Edition/cmeecp_topic_4.html By AJ.Surath Kasembunsiri Born2Learn How to install and configure Apache Hadoop
  • 30. 59By AJ.Surath Kasembunsiri Born2Learn Setup Cloudera (1) 1. Download the installer: sudo wget http://archive.cloudera.com/cm5/installer/latest/cloudera- manager-installer.bin 2. Change cloudera-manager-installer.bin to have executable permission. sudo chmod u+x cloudera-manager-installer.bin 3. Run the Cloudera Manager Server installer. sudo ./cloudera-manager-installer.bin 4. Follow wizard until setup finish Ref: www.cloudera.com/documentation/enterprise/5-4-x/topics/cm_qs_quick_start.html 60By AJ.Surath Kasembunsiri Born2Learn Setup Cloudera (2) 5. Wait several minutes for the Cloudera Manager Server to complete its startup. 6. In a web browser, enter http://Server host:7180, where the Cloudera Manager Server is running. The login screen for Cloudera Manager Admin Console displays. 7. Log into Cloudera Manager Admin Console with the credentials: Username: admin Password: admin. 8. Follow instruction setup from Cloudera Manager Wizard until finish. Ref: www.cloudera.com/documentation/enterprise/5-4-x/topics/cm_qs_quick_start.html
  • 31. 61By AJ.Surath Kasembunsiri Born2Learn Lab: Setup Cloudera Cluster on AWS In this lab, you will see how to:  Setup Cloudera Cluster on AWS Cloudera Cluster 62By AJ.Surath Kasembunsiri Born2Learn First Test on Hadoop (1)  Use Putty to connect EC2 Server and Test code in below.. ubuntu@ip-172-31-39-55:~$ hdfs dfs -ls / Found 4 items drwxr-xr-x - hbase hbase 0 2016-04-30 08:55 /hbase drwxrwxr-x - solr solr 0 2016-04-30 08:55 /solr drwxrwxrwt - hdfs supergroup 0 2016-04-30 08:59 /tmp drwxr-xr-x - hdfs supergroup 0 2016-04-30 09:00 /user ubuntu@ip-172-31-39-55:~$ sudo -u hdfs hdfs dfs -mkdir /user/ubuntu ubuntu@ip-172-31-39-55:~$ sudo -u hdfs hdfs dfs -chown ubuntu /user/ubuntu ࢌÒä»´Ùâ¤Ã§ÊÌҧ¢Í§ hdfs ÊÑ่§ÊÌҧ directory “ubuntu” â´Â㪌ÊÔ·¸Ô ¢Í§ “hdfs” ÊÑ่§á¡Œä¢ directory “ubuntu” â´ÂãËŒÊÔ·¸Ô owner á´‹ user:ubuntu
  • 32. 63By AJ.Surath Kasembunsiri Born2Learn First Test on Hadoop (2)  Use Putty to connect EC2 Server and Test code in below.. ubuntu@ip-172-31-39-55:~$ hdfs dfs -ls /user/ Found 7 items drwxrwxrwx - mapred hadoop 0 2016-04-30 08:56 /user/history drwxrwxr-t - hive hive 0 2016-04-30 08:59 /user/hive drwxrwxr-x - hue hue 0 2016-04-30 09:00 /user/hue drwxrwxr-x - impala impala 0 2016-04-30 09:00 /user/impala drwxrwxr-x - oozie oozie 0 2016-04-30 09:01 /user/oozie drwxr-x--x - spark spark 0 2016-04-30 08:57 /user/spark drwxr-xr-x - ubuntu supergroup 0 2016-04-30 09:05 /user/ubuntu ubuntu@ip-172-31-39-55:~$ hdfs dfs -mkdir input-data ubuntu@ip-172-31-39-55:~$ hdfs dfs -ls /user/ubuntu/ Found 1 items drwxr-xr-x - ubuntu supergroup 0 2016-04-30 09:16 /user/ubuntu/input-data ࢌÒä»´Ù¢ŒÍÁÙÅã¹ Directory ¢Í§ “user” º¹ HDFS ¾º Directory ·Õ่ÊÌҧ¢Ö้¹“owner” ÊÑ่§ÊÌҧ Directory à¾Ô่ÁàµÔÁ ࢌÒä»´Ù¢ŒÍÁÙÅã¹ Directory ¢Í§ “ubuntu” º¹ HDFS ¾º Directory ·Õ่ÊÌҧ¢Ö้¹ 64By AJ.Surath Kasembunsiri Born2Learn Cloudera VM (1)  Cloudera VM contains a single-node Apache Hadoop cluster along with everything you need to get started with Hadoop.  Requirements:  A 64-bit host OS  A virtualization software: VMware Player, KVM, or Virtual Box.  Virtualization Software will require a laptop that supports virtualization. If you are unsure, one way this can be checked by looking at your BIOS and seeing if Virtualization is Enabled.  A 8 GB of total RAM with 2 vCPUs.  The total system memory required varies depending on the size of your data set and on the other processes that are running.
  • 33. 65By AJ.Surath Kasembunsiri Born2Learn Cloudera VM (2)  Step#1: Download & Run Vmware  Step#2: Download Cloudera VM  Step#3: Extract to the Cloudera folder.  Step#4: Open the "cloudera-quickstart-vm-xx-vmware" 66By AJ.Surath Kasembunsiri Born2Learn Lab: Setup Cloudera VM In this lab, you will see how to:  Setup Cloudera VM  Open Cloudera VM Cloudera (VM)
  • 34. 67By AJ.Surath Kasembunsiri Born2Learn About Cloudera Manager The industry’s first for Apache Hadoop the Apache Hadoop stack Automates the of Apache Hadoop DISCOVER DIAGNOSE OPTIMIZEACT HDFS MAPREDUCE HBASE ZOOKEEPER OOZIE HUE 68By AJ.Surath Kasembunsiri Born2Learn Cloudera Manager Interface (1)
  • 35. 69By AJ.Surath Kasembunsiri Born2Learn Cloudera Manager Interface (2) 70By AJ.Surath Kasembunsiri Born2Learn Cloudera Manager Interface (3)
  • 36. 71By AJ.Surath Kasembunsiri Born2Learn Block Size & RF Config. Setting  Cloudera Manager -> HDFS Services -> Configuration 72By AJ.Surath Kasembunsiri Born2Learn About Hue (Hadoop User Experience)  Lightweight Web server that lets you use Hadoop directly from your browser. (Open source web interface)  Makes Hadoop platform (HDFS, Map reduce, Hive, etc.) easy to use
  • 37. 73By AJ.Surath Kasembunsiri Born2Learn Example: Hue Web Interface 74By AJ.Surath Kasembunsiri Born2Learn Like Me if U Can youtube.com/Born2Learn TH facebook.com/Born2LearnTH www.born2learn.net surath@born2learn.net