Workshop - Best of Both Worlds_ Combine KG and Vector search for enhanced R...
First Step for Big Data with Apache Hadoop
1. By AJ.Surath Kasembunsiri Born2Learn
First Step for Big Data
with Apache Hadoop
By AJ. Surath Kasembunsiri
NCS L.3, CompTIA Security+, ITIL Foundation v.3 , CCNA
MCT, MCITP Enterprise, MCSE +Security +Messaging
2By AJ.Surath Kasembunsiri Born2Learn
Email :
Portfolio:
ÇÔ·ÂÒ¡Ã: Í.ÊØÃѵ¹ à¡ÉÁºØÈÔÃÔ
More than 15 years IT experience, 10+ Enterprise consulting in IT
Business area to many company such as EGAT, Trade Siam etc.
Instructor to many organization with NTSDA Academy, Software Park,
Microsoft, EGAT, ACIS, GSB, Kasikorn Bank and more.
surath@born2learn.net
2. 3By AJ.Surath Kasembunsiri Born2Learn
Agenda
Overview of Big Data
Describes about ecosystem of Apache Hadoop
Hadoop Basic
How to install and configure Apache Hadoop
By AJ.Surath Kasembunsiri Born2Learn
Overview of Big Data
3. 5By AJ.Surath Kasembunsiri Born2Learn
What’s Big Data?
No single definition; here is from Wikipedia:
Big data is the term for a collection of data sets so large and
complex that it becomes difficult to process using on-hand
database management tools or traditional data processing
applications.
The challenges include capture, storage, search, sharing, transfer,
analysis, and visualization.
6By AJ.Surath Kasembunsiri Born2Learn
Big Data Every Where
Lots of data is being collected
and warehoused
Telecom (Call Detail Record)
Web data, e-commerce
purchases at department/
grocery stores
Bank/Credit Card
transactions
Social Network
Health Care
4. 7By AJ.Surath Kasembunsiri Born2Learn
Big Data Data Analytics
8By AJ.Surath Kasembunsiri Born2Learn
Business without analytics
source: http://workpointtv.com/news/4756
5. 9By AJ.Surath Kasembunsiri Born2Learn
A New Set of Questions
How do I optimize my
fleet based on weather
and traffic patterns?
What’s the social
sentiment for my
brand or products
How do I better
predict future
outcomes?
10By AJ.Surath Kasembunsiri Born2Learn
Next industrial revolution (1)
6. 11By AJ.Surath Kasembunsiri Born2Learn
Next industrial revolution (2)
12By AJ.Surath Kasembunsiri Born2Learn
Big Data: 3V’s
7. 13By AJ.Surath Kasembunsiri Born2Learn
Volume (Scale)
Data Volume
44x increase from 2009 - 2020
From 0.8 zettabytes to 35zb
Data volume is increasing exponentially
Exponential increase in
collected/generated data
14By AJ.Surath Kasembunsiri Born2Learn
About Data Source
source: www-03.ibm.com/press/us/en/photo/39145.wss
9. 17By AJ.Surath Kasembunsiri Born2Learn
Velocity (Speed)
Data is begin
generated fast
and need to
be processed
fast.
18By AJ.Surath Kasembunsiri Born2Learn
Some Make it 4V’s
10. 19By AJ.Surath Kasembunsiri Born2Learn
Harnessing Big Data
OLTP: Online Transaction Processing (DBMSs)
OLAP: Online Analytical Processing (Data Warehousing)
RTAP: Real-Time Analytics Processing (Big Data Architecture &
technology)
20By AJ.Surath Kasembunsiri Born2Learn
How to implement Big Data
Define Business Question
Define Data Source
Define Technology need Big Data or not?
11. 21By AJ.Surath Kasembunsiri Born2Learn
Database Technology (1)
22By AJ.Surath Kasembunsiri Born2Learn
Database Technology (2)
DATA STORAGAE DATA PROCESS
12. 23By AJ.Surath Kasembunsiri Born2Learn
Database Technology (3)
By AJ.Surath Kasembunsiri Born2Learn
Describes about ecosystem of
Apache Hadoop
13. 25By AJ.Surath Kasembunsiri Born2Learn
What is Hadoop?
Hadoop Distributed
File System (HDFS)
File Sharing & Data Protection
Across Physical Servers
MapReduce
Distributed Computing Across
Physical Servers
Flexibility
A single repository for storing
processing & analyzing any type of
data
Not bound by a single schema
Scalability
Scale-out architecture divides
workloads across multiple nodes
Flexible file system eliminates ETL
bottlenecks
Low Cost
Can be deployed on commodity
hardware
Open source platform guards against
vendor lock
Hadoop is a platform for data storage
and processing that is…
Scalable
Fault tolerant
Open source (Apache license)
Written with Java
CORE HADOOP COMPONENTS
26By AJ.Surath Kasembunsiri Born2Learn
Hadoop is not for all type of work
Not good to process transactions
Not good when work cannot be parallelized
Not good for low latency data access
Not good for processing lots of small files
Not good for intensively calculation with little data
Source www.bigdatauniversity.com
14. 27By AJ.Surath Kasembunsiri Born2Learn
The Need for Hadoop
Store and use all types of data
Process ALL the data; not just a sample
Scalability to many of nodes
Commodity hardware
28By AJ.Surath Kasembunsiri Born2Learn
What Makes Hadoop Different?
Ability to scale out to Petabytes in size using commodity
hardware
Processing (MapReduce) jobs are sent to the data versus shipping
the data to be processed
Hadoop doesn’t impose a single data format so it can easily
handle structure, semi-structure and unstructured data
Manages fault tolerance and data replication automatically
15. 29By AJ.Surath Kasembunsiri Born2Learn
Relational Database vs. Hadoop (1)
Relational Hadoop
Required on write schema Required on Read
Reads are fast speed Writes are fast
Standards and structure governance Loosely structured
Limited, no data processing processing Processing coupled with data
Structured data types Multi and unstructured
Interactive OLAP Analytics
Complex ACID Transactions
Operational Data Store
best fit use Historical/ Archive Data
Processing unstructured data
Massive storage/processing
30By AJ.Surath Kasembunsiri Born2Learn
Relational Database vs. Hadoop (2)
16. 31By AJ.Surath Kasembunsiri Born2Learn
History
Originally built as a Infrastructure for the “Nutch” project.
Based on Google’s map reduce and Google File System.
Created by Doug Cutting in 2005 at Yahoo
Named after his son’s toy yellow elephant.
32By AJ.Surath Kasembunsiri Born2Learn
Hadoop Timeline
17. 33By AJ.Surath Kasembunsiri Born2Learn
Apache Hadoop Ecosystem (1)
34By AJ.Surath Kasembunsiri Born2Learn
Apache Hadoop Ecosystem (2)
HDFS Primary Distributed File System
HBase Column-oriented database scaling to billions of rows
HCatalog table and storage management layer for Hadoop that enables
Hadoop applications (Pig, MapReduce, and Hive) to use.
Hive Data warehouse with SQL-like access
Pig High-level language for expressing data analysis programs
Mahout Machine Learning Library
Sqoop Imports data from relational databases
Flume Collection and import of log and event data
Oozie Hadoop workflow.
Zookeeper Centralized service for maintaining configuration
Ambari Cluster management
18. 35By AJ.Surath Kasembunsiri Born2Learn
Hadoop 1.0 vs Hadoop 2.0
36By AJ.Surath Kasembunsiri Born2Learn
Hadoop 1.0 vs Hadoop 2.0
19. 37By AJ.Surath Kasembunsiri Born2Learn
Hadoop 2.0 Core Components
HDFS: A scalable and fault tolerant distributed filesystem to data
in any form.
Yet Another Resource Negotiator (YARN): the cluster
management layer to handle various workloads on the cluster.
MapReduce: a framework that allows parallel processing of data
in Hadoop
38By AJ.Surath Kasembunsiri Born2Learn
Apache YARN
Data Processing Engines Run Natively IN Hadoop
BATCH
MapReduce
INTERACTIVE
Tez
STREAMING
Storm
GRAPH
Giraph
MICROSOFT
REEF
SAS
LASR, HPA
ONLINE
HBase
OTHERS
HDFS: Redundant, Reliable Storage
YARN: Cluster Resource Management
Flexible
Enables other purpose-built data
processing models beyond
MapReduce (batch), such as
interactive and streaming
Efficient
Double processing IN Hadoop on
the same hardware while
providing predictable
performance & quality of service
Shared
Provides a stable, reliable,
secure foundation and
shared operational services
across multiple workloads
The resource manager for Hadoop 2.0
20. 39By AJ.Surath Kasembunsiri Born2Learn
Many way to Implement Hadoop (1)
Open Source License
Distribution
(Software License)
Cloud Services
40By AJ.Surath Kasembunsiri Born2Learn
Many way to Implement Hadoop (2)
Hadoop Hardware Appliance
Cisco offers Unified Computing System
Dell delivers a disk-intensive server (R720XD sever)
HP "racks" up big data in a box (DL360p sever)
IBM and Lenovo partner (x3650)
Oracle Big Data Appliance
And more..
Source: informationweek.com/big-data/hardware-
architectures/10-hadoop-hardware-leaders
22. 43By AJ.Surath Kasembunsiri Born2Learn
Data Warehouse vs. Data Lake
source: www.kdnuggets.com/2015/09/data-lake-vs-
data-warehouse-key-differences.html
44By AJ.Surath Kasembunsiri Born2Learn
ENTERPRISE HADOOP WITH DATA LAKE (1)
Source: http://hortonworks.com/blog/optimize-your-data-architecture-with-hadoop/
23. 45By AJ.Surath Kasembunsiri Born2Learn
ENTERPRISE HADOOP WITH DATA LAKE (2)
Source: http://hortonworks.com/blog/optimize-your-data-architecture-with-hadoop/
46By AJ.Surath Kasembunsiri Born2Learn
How Data Lake Works?
25. 49By AJ.Surath Kasembunsiri Born2Learn
About Hadoop Cluster
Both Storage and Processing
- Via Master and Slave nodes
5 long-running daemons
- Storage
• NameNode (Master)
• Secondary NameNode (Master)
• DataNode (Slaves)
- Processing
• Resource Manager (Master)
• Node Manager (Slaves)
Name Node
Data Node
50By AJ.Surath Kasembunsiri Born2Learn
Hadoop Software Requirement
Linux
• Redhat (CentsOS), Debian (Ubuntu), Suse
• Native packages & tarballs available from Cloudera & Hortonworks
Windows
• Windows Server
• Native packages & tarballs available from Hortonworks
Both Cloudera & Hortonworks package installs also install all
supporting software & setup accounts
• JDK, mapred and hdfs linux users, etc..
26. 51By AJ.Surath Kasembunsiri Born2Learn
Hadoop Run Modes - Benefit
52By AJ.Surath Kasembunsiri Born2Learn
Hadoop Setup – Recommendations (1)
Use Cloudera
• Cloudera Manager
• Package installs
Use Linux
NameNode – Holds all metadata for HDFS
Needs to be a highly reliable machine
RAID drives – typically RAID 10
Dual power supplies
Dual network cards – Bonded
The more memory the better – typical 36GB to - 64GB
Secondary NameNode – Provides check pointing for the
NameNode. Same hardware as the NameNode should be used
27. 53By AJ.Surath Kasembunsiri Born2Learn
Hadoop Setup – Recommendations (2)
DataNodes – Hardware will depend on the specific needs of the
cluster
No RAID needed, JBOD (just a bunch of disks) is used
Typical ratio is:
1 hard drive
2 cores
4GB of RAM
54By AJ.Surath Kasembunsiri Born2Learn
About Cloudera
Fastest Path to Success
No need to write your own scripts or do
integration testing on different components
Works with a wide range of operating
systems, hardware, databases and data
warehouses
Stable and Reliable
Extensive Cloudera QA systems, software &
processes
Tested & run in production at scale
Proven at scale in dozens of enterprise
environments
Community Driven
Incorporates only main-line components
from the Apache Hadoop ecosystem – no
forks or proprietary underpinnings
FREE
Cloudera’s Distribution Including
Apache Hadoop (CDH) is an enterprise-ready
distribution of Hadoop that is…
100% Apache open source
Contains all components needed for deployment
Fully documented and supported
Released on a reliable schedule
28. 55By AJ.Surath Kasembunsiri Born2Learn
Requirements for Setup Cloudera (1)
You must have root or password-less sudo access to the hosts
If using root, the hosts must accept the same root password
The hosts must have Internet access to allow the wizard to install
software from archive.cloudera.com
Cluster hosts must have a working network name resolution
system and correctly formatted /etc/hosts file
56By AJ.Surath Kasembunsiri Born2Learn
Requirements for Setup Cloudera (2)
Must have SSH access to the cluster hosts when you run the
installation or upgrade wizard
No blocking is done by Security-Enhanced Linux (SELinux)
IPv6 must be disabled.
No blocking by iptables or firewalls; port 7180 must be open
because it is used to access Cloudera Manager
After installation. Cloudera Manager communicates using specific
ports, which must be open.
29. 57By AJ.Surath Kasembunsiri Born2Learn
Hadoop Port Numbers
More port information on
http://www.cloudera.com/documentation/archive/manager/4-x/4-5-1/Configuring-
Ports-for-Cloudera-Manager-Enterprise-Edition/cmeecp_topic_4.html
By AJ.Surath Kasembunsiri Born2Learn
How to install and
configure Apache Hadoop
30. 59By AJ.Surath Kasembunsiri Born2Learn
Setup Cloudera (1)
1. Download the installer:
sudo wget http://archive.cloudera.com/cm5/installer/latest/cloudera-
manager-installer.bin
2. Change cloudera-manager-installer.bin to have executable
permission.
sudo chmod u+x cloudera-manager-installer.bin
3. Run the Cloudera Manager Server installer.
sudo ./cloudera-manager-installer.bin
4. Follow wizard until setup finish
Ref: www.cloudera.com/documentation/enterprise/5-4-x/topics/cm_qs_quick_start.html
60By AJ.Surath Kasembunsiri Born2Learn
Setup Cloudera (2)
5. Wait several minutes for the Cloudera Manager Server to
complete its startup.
6. In a web browser, enter http://Server host:7180, where the
Cloudera Manager Server is running. The login screen for Cloudera
Manager Admin Console displays.
7. Log into Cloudera Manager Admin Console with the credentials:
Username: admin Password: admin.
8. Follow instruction setup from Cloudera Manager Wizard until
finish.
Ref: www.cloudera.com/documentation/enterprise/5-4-x/topics/cm_qs_quick_start.html
31. 61By AJ.Surath Kasembunsiri Born2Learn
Lab: Setup Cloudera Cluster on AWS
In this lab, you will see how to:
Setup Cloudera Cluster on AWS
Cloudera Cluster
62By AJ.Surath Kasembunsiri Born2Learn
First Test on Hadoop (1)
Use Putty to connect EC2 Server and Test code in below..
ubuntu@ip-172-31-39-55:~$ hdfs dfs -ls /
Found 4 items
drwxr-xr-x - hbase hbase 0 2016-04-30 08:55 /hbase
drwxrwxr-x - solr solr 0 2016-04-30 08:55 /solr
drwxrwxrwt - hdfs supergroup 0 2016-04-30 08:59 /tmp
drwxr-xr-x - hdfs supergroup 0 2016-04-30 09:00 /user
ubuntu@ip-172-31-39-55:~$ sudo -u hdfs hdfs dfs -mkdir /user/ubuntu
ubuntu@ip-172-31-39-55:~$ sudo -u hdfs hdfs dfs -chown ubuntu /user/ubuntu
ࢌÒä»´Ùâ¤Ã§ÊÌҧ¢Í§ hdfs
ÊÑ่§ÊÌҧ directory “ubuntu” â´Â㪌ÊÔ·¸Ô
¢Í§ “hdfs”
ÊÑ่§á¡Œä¢ directory “ubuntu” â´ÂãËŒÊÔ·¸Ô
owner á´‹ user:ubuntu
32. 63By AJ.Surath Kasembunsiri Born2Learn
First Test on Hadoop (2)
Use Putty to connect EC2 Server and Test code in below..
ubuntu@ip-172-31-39-55:~$ hdfs dfs -ls /user/
Found 7 items
drwxrwxrwx - mapred hadoop 0 2016-04-30 08:56 /user/history
drwxrwxr-t - hive hive 0 2016-04-30 08:59 /user/hive
drwxrwxr-x - hue hue 0 2016-04-30 09:00 /user/hue
drwxrwxr-x - impala impala 0 2016-04-30 09:00 /user/impala
drwxrwxr-x - oozie oozie 0 2016-04-30 09:01 /user/oozie
drwxr-x--x - spark spark 0 2016-04-30 08:57 /user/spark
drwxr-xr-x - ubuntu supergroup 0 2016-04-30 09:05 /user/ubuntu
ubuntu@ip-172-31-39-55:~$ hdfs dfs -mkdir input-data
ubuntu@ip-172-31-39-55:~$ hdfs dfs -ls /user/ubuntu/
Found 1 items
drwxr-xr-x - ubuntu supergroup 0 2016-04-30 09:16 /user/ubuntu/input-data
ࢌÒä»´Ù¢ŒÍÁÙÅã¹ Directory ¢Í§ “user” º¹ HDFS
¾º Directory ·Õ่ÊÌҧ¢Ö้¹“owner”
ÊÑ่§ÊÌҧ Directory à¾Ô่ÁàµÔÁ
ࢌÒä»´Ù¢ŒÍÁÙÅã¹ Directory ¢Í§ “ubuntu” º¹ HDFS
¾º Directory ·Õ่ÊÌҧ¢Ö้¹
64By AJ.Surath Kasembunsiri Born2Learn
Cloudera VM (1)
Cloudera VM contains a single-node Apache Hadoop cluster along
with everything you need to get started with Hadoop.
Requirements:
A 64-bit host OS
A virtualization software: VMware Player, KVM, or Virtual Box.
Virtualization Software will require a laptop that supports
virtualization. If you are unsure, one way this can be checked by
looking at your BIOS and seeing if Virtualization is Enabled.
A 8 GB of total RAM with 2 vCPUs.
The total system memory required varies depending on the size
of your data set and on the other processes that are running.
33. 65By AJ.Surath Kasembunsiri Born2Learn
Cloudera VM (2)
Step#1: Download & Run Vmware
Step#2: Download Cloudera VM
Step#3: Extract to the Cloudera folder.
Step#4: Open the "cloudera-quickstart-vm-xx-vmware"
66By AJ.Surath Kasembunsiri Born2Learn
Lab: Setup Cloudera VM
In this lab, you will see how to:
Setup Cloudera VM
Open Cloudera VM
Cloudera
(VM)
34. 67By AJ.Surath Kasembunsiri Born2Learn
About Cloudera Manager
The industry’s first
for Apache Hadoop
the
Apache Hadoop stack
Automates the
of Apache Hadoop
DISCOVER DIAGNOSE OPTIMIZEACT
HDFS MAPREDUCE HBASE
ZOOKEEPER OOZIE HUE
68By AJ.Surath Kasembunsiri Born2Learn
Cloudera Manager Interface (1)
36. 71By AJ.Surath Kasembunsiri Born2Learn
Block Size & RF Config. Setting
Cloudera Manager -> HDFS Services -> Configuration
72By AJ.Surath Kasembunsiri Born2Learn
About Hue (Hadoop User Experience)
Lightweight Web server that lets you use Hadoop directly from your
browser. (Open source web interface)
Makes Hadoop platform (HDFS, Map reduce, Hive, etc.) easy to use
37. 73By AJ.Surath Kasembunsiri Born2Learn
Example: Hue Web Interface
74By AJ.Surath Kasembunsiri Born2Learn
Like Me if U Can
youtube.com/Born2Learn TH
facebook.com/Born2LearnTH
www.born2learn.net
surath@born2learn.net