Hortonworks HDP, Is it goog enough ?

© Huxi LI, 2018
HADOOP BIG DATA PLATFORM
Hortonworks HDP, Is it good enough ?

© Huxi LI, 2018
AGENDA
1|Why Hadoop ?
2|Hortonworks HDP, is it good enough ?
3|Hadoop cluster provision with Ambari
4|Demos of typical use cases
Annexes
2

© Huxi LI, 2018
Digitalization
INDUSTRIAL TRENDS
• Digitalization of core business: Going digital is now a core strategy
for many organizations around the world, including financial et insurance
core business. Most successful enterprises deliver their core business
through digital channels.
• Digitalization of society: Social networks, collaborative platforms, e-
government, and smart devices are changing the entire society.
- Social network is everywhere and start impacting the business model;
- Industrialized countries are pushing the digitalization of their entire public
services as a way of modernizing governments and better serving their citizens.
• Internet Of Thing (IoT): The internet are expanding to physical objects;
experts estimate that the IoT will consist of about 30 billion objects by 2020
(IEEE, 2016).
• Artificial Intelligence (AI): The maturity of AI reaches another level,
from academic experimental stage into massive practical usage era. AI not
only need massive data and but also produce massive data.
3
A matter of Data

© Huxi LI, 2018
AGE OF BIG DATA
4
Big Volume
High Variety
Uncertain Veracity
High Velocity
In 2016, global mobile data traffic
amounted to 7 exabytes per month. In
2021, mobile data traffic worldwide is
expected to reach 49 exabytes per month
at a compound annual growth rate of 47
percent (Statista, 2018)
Variety of data can be numerous - structured,
semi-structured and mostly unstructured
data as well.
Velocity of data producing can be variable
from slow batch processing, near real-time
streaming, to real-time IoT sensors.
Veracity is one of the unfortunate
characteristics of today’s data. Not all data
are trustworthy.

© Huxi LI, 2018
HADOOP = DESIGNED FOR BIG DATA
Hadoop allows for the distributed processing of large data sets across clusters
of computers using simple programming models. It is designed to scale up to
thousands of machines, each offering local computation and storage.
• Massive Storage (Volume): Hadoop’s distributed file system (HDFS), is
specially designed to store massive data on commodity hardware, providing
high-throughput access to application data.
• Any kind of Data (Variety): You can save text, image, video, XML, JSON and
any other type of data in HDFS, as if you are working on a unix-like file system
- this is very important because of the variety of today’s data.
• Massively Scalable Computation: Parallel computing is integrated into
Hadoop core system – YARN for cluster resource management and
MapReduce for parallel processing of large data sets. And many other derived
tools such as Storm, Hive, and so on you can choose from.
5

© Huxi LI, 2018
HADOOP = A MATURE SYSTEM
• Production-proved: Hadoop has been
confirmed in production by many
companies around the world.
• Available on Cloud: Hadoop are now
deployed on all major cloud platforms
such Microsoft AZURE, Amazon AWS,
IBM Cloud, and Google GCP.
6

© Huxi LI, 2018
HADOOP = A RICH ECOSYSTEM
The term Hadoop refers to the entire ecosystem or collection of software packages that can be
installed on top of or alongside Hadoop.
Many choices :
• Ingestion : Scoop, Flume, Kafka, NIFI, Storm, Flink, …
• Streaming & Compute: MapReduce, Storm, Spark, Flink, Nifi, …
• Analysis/SQL-on-Hadoop: Impala, Hive, Drill, …
• Machine learning: Mahout, Spark ML, …
• NoSQL: HBase, Cassandra, …
• Management: Ambari, HUE, …
7

© Huxi LI, 2018
VALUE PROPOSITIONS OF HADOOP
8
Designed for
Big Data
• Massively scalable
storage and
computation
• Carefully designed for
large scale
deployment
High
Maturity
• Production-proved,
highly robust
• Confirmed by big
clusters with
thousands of nodes
Rich
Ecosystem
• Ecosystem formed
around Hadoop with
many choices, with
enterprise-grade
quality but freely
available
• Wide supports,
commercially or of
community.
High
Scalability
• Technically, a single
Hadoop cluster can
be scaled to
thousands of nodes
• Functionally, Hadoop
support any kind of
data, scalable to a
wide range of use
cases.

© Huxi LI, 2018
AGENDA
1|Why Hadoop ?
2|Hortonworks HDP, is it good enough ?
3|Hadoop cluster provision with Ambari
Annexes
9

© Huxi LI, 2018
“Enterprises like Hortonworks’ storage and
compute processing, broad data ingestion,
data governance, and open source support
when deploying BDW.” - Forrester 2017
HDP = ENTERPRISE-GRADE HADOOP + 100% OPEN SOURCE
Enterprise-grade
Hadoop
Hortonworks HDP is the only
Enterprise-Grade Hadoop
distribution of fully Open
Source
Apache License
All of the technology built into
the Hortonworks Data Platform
is an Apache open source
project
100% Open Source
Hortonworks HDP is the only
Enterprise-Grade Hadoop
distribution of fully Open
Source
No Vendor Lock-In
HDP delivers enterprise-grade
software that fosters
innovation and prevents vendor
lock-in
HDP

© Huxi LI, 2018
HDP = CAREFULLY ENGINEERED SECURITY ARCHITECTURE
• Platform built-in security and data governance.
Built-In Security
• Support integration into enterprise’s existing security system with support of industrial
standards, such as Kerberos and LDAP.
Corporate Integration
• Centralized cluster management, monitoring, and access control.
Centralized control
• Simple yet powerful, Unix-like, resource permission control (HDFS)
Simplicity

© Huxi LI, 2018
HDP = LEADER OF BIG DATA WAREHOUSE
• Hortonworks was ranked as a leader in The Forrester Wave™:
Big Data Warehouse (Q2 2017).
12
“Hortonworks delivers actionable intelligence from all kinds of
data-in-motion and data-at-rest. Through its open source strategy,
Hortonworks continually evolves its offering by working closely
with partners across the EDW ecosystem of tools and vendors. The
vendor provides a cost-effective, nimble, and scalable architecture
to implement big data warehouses, whether on-premises or in the
cloud.” - Forrester 2017

© Huxi LI, 2018
HORTONWORKS HDP
13

© Huxi LI, 2018
VALUE PROPOSITIONS OF HORTONWORKS HDP
Cost
• No license fee
• Pay for support
Openness
• Open source
• With commercial
friendly Apache
license
Expertise
• Key contributors of
Hadoop
• Decade of industrial
Big Data experience
Security
• Guarantee of a
market leader
• Mature technology,
Support on premise,
cloud, or appliance.

© Huxi LI, 2018
AGENDA
1|Why Hadoop ?
2|Hortonworks HDP, is it good for me ?
3|Hadoop Cluster Provision with Ambari
Annexes
15

© Huxi LI, 2018
AMBARI HADOOP CLUSTER PROVISION
Explore
Hadoop
Cluster
With
Ambari
Create
Hadoop
Cluster
With
Ambari
Install
Ambari
server
Prepare VMs
Ambari
1
2
3
4
Manual, Ansible, Puppet …

© Huxi LI, 2018
TYPICAL HADOOP CLUSTERS
• Hadoop backbone: The Hadoop backbone is typically
composed of HDFS, MapReduce and YARN, i.e.
the core of Hadoop cluster:
• HDFS for massive storage service
• MapReduce for parallel computing of datasets
• YARN for job scheduling or resource Mgt
• Edge nodes: Edge nodes are part of the Hadoop
cluster, but they are typically connected to the rest
of the corporate network, providing operational
environments to end users.
• Ambari is a typical edge node providing cluster
management service
• Scoop & Flume are another kind of edge nodes providing dataflow IN or OUT of the cluster
• Spark & Hive are edge nodes serving applications or business users
17
Corporate Network
Administration Applications
Edge nodes
(e.g. Hive)
Management
(e.g. Ambari)

© Huxi LI, 2018
A PROOF-OF-CONCEPT HADOOP CLUSTER
• POC Hadoop cluster:
• Composed of 3 nodes,
• Share VMs for edge and backbone nodes
• Hortonworks HDP 2.6.1.3
• AWS VPC 172.31.0.0/16
• Provisioner:
• Ambari as provisioner of the cluster
18
VPC: 172.31.0.0/16
hdp-master (172.31.45.112)
hdp-slave1
(172.31.37.152)
hdp-slave2
(172.31.45.181)
Namenode,
Ambari, etc.
Datanode,
Hive, Storm,
etc.
Datanode,
Nodemanager,
etc.

© Huxi LI, 2018
MACHINE SETUP
• Provision VMs with AWS:
• 3 EC2 instances
• Redhat 7.0 Linux
• 1 VPC allowing traffic among nodes
19

© Huxi LI, 2018
NETWORK TRAFFIC SETUP
• Authorized traffic (AWS):
20
Cluster internal traffic

© Huxi LI, 2018
INSTALL AMBARI SERVER
• Follow the official installation guide for details
21

© Huxi LI, 2018
SETUP HADOOP CLUSTER WITH AMBARI
22
Click Here
Ambari Web UI: http://hdp-master.huxili.com:8080

© Huxi LI, 2018
SETUP HADOOP CLUSTER WITH AMBARI
23
Choose Hadoop nodes

© Huxi LI, 2018
EXPLORE HADOOP CLUSTER WITH AMBARI
24

© Huxi LI, 2018
AGENDA
1|Why Hadoop ?
2|Hortonworks HDP, is it good for me ?
3|Hadoop Cluster Provision with Ambari
Annexes
25

© Huxi LI, 201826
1. Data storage in HDFS …

© Huxi LI, 2018
DEMO 1 - SAVING DATA IN HDFS
Ambari UI
Upload
Files
Create
directory
File View
• Open Ambari UI at http://hdp-master.huxili.com
Manage
with
Ambari
• Open File View from Ambari UI
• Create a directory in HDFS : /user/huxili/dfs_demo
• Manipulate uploaded files (rename, permission, Ctrl+Mouse => Select directory)
Set permission
• Upload files into the above create directory

© Huxi LI, 201828
2. JSON Data query with Hive …

© Huxi LI, 2018
DEMO 2 – DATA QUERY WITH HIVE
Ambari UI
Create
table for
JSON
data
Create
schema
for
DEMO
Hive
View
• Check sample twitter data (JSON) in /user/huxili/twitter_data
Hive Query
with
Ambari
• Open Ambari Hive View 2.0
• Create a database for demo purpose -> create database demo;
• Query JSON data in Hive
Query JSON
Data in Hive
• Create an external table demo_tweets pointing to /user/huxili/twitter_data

© Huxi LI, 201830
3. Data Streaming with Storm …

© Huxi LI, 2018
DEMO 3 – STREAMING OF TWEETS WITH STORM
31
StreamingSpout
Tweets Splitter
Bolt
Filter
Bolt
Counter
Bolt
Aggregation &
HDFS writing
Bolt
Hadoop HDFS
(File: /user/huxili/storm_out/counts.txt)

© Huxi LI, 2018
DEMO 3 – DATA STREAMING WITH STORM
Ambari UI
SSH into
Storm
Host
Upload
Streaming
Example
File View
• Open Ambari File View -> Upload example streaming jar
Explore
Results in
Ambari
• SSH into hdp-slave1 (Storm host)
• su root & download the streaming example (hdfs dfs -get /user/huxili/storm_apps/tw*.jar)
• Streaming will stop after 30 seconds and explore results in Ambari
Submit
streaming App
to Storm
• Submit the streaming to storm

© Huxi LI, 2018
ANNEXES
A1|Ambari Server Setup
• Provision VMs on AWS
• Install required packages
• Install Ambari Server
A2|Hadoop Cluster Setup
A3|Streaming Architecture Hive/Storm
34

© Huxi LI, 2018
• Provision VMs:
• Using AWS EC2 instance
• Redhat 7.0 Linux
• AWS / VPC, 3 nodes
A1 - AMBARI SERVER SETUP
35
MACHINES SETUP

© Huxi LI, 2018
• Updating /etc/hosts (All nodes):
• 172.31.45.112 hdp-master.huxili.com
• 172.31.37.152 hdp-slave1.huxili.com
• 172.31.45.181 hdp-slave2.huxili.com
• Updating /etc/hostname (All node):
• hdp-master.huxili.com (Master)
• hdp-slave1.huxili.com (slave1)
• hdp-slave2.huxili.com (slave2)
36
MACHINES SETUP

© Huxi LI, 2018
• Authorized traffic (AWS):
37
NETWORK TRAFFIC SETUP
Cluster internal traffic

© Huxi LI, 2018
ANNEXES
38

© Huxi LI, 2018
• Hdp-master.huxli.com:
• sudo yum install -y ntp
• sudo systemctl enable ntpd
• sudo yum install -y postgresql-jdbc
• sudo yum install -y wget
• sudo wget -nv http://public-repo-1.hortonworks.com/ambari/centos7/2.x/updates/2.6.1.3/ambari.repo -O
/etc/yum.repos.d/ambari.repo
• sudo yum install -y ambari-server
39
INSTALL REQUIRED PACKAGES

© Huxi LI, 2018
• hdp-slave1.huxili.com & hdp-slave2.huxili.com:
• sudo yum install -y ntp
• sudo systemctl enable ntpd
40

© Huxi LI, 2018
41
• hdp-slave1.huxili.com: Install hive metastore DB (postgres)

© Huxi LI, 2018
42
• hdp-slave1.huxili.com: Check Hive DB

© Huxi LI, 2018
ANNEXES
A2|Setup Hadoop Cluster
43

© Huxi LI, 2018
• Commands to run:
44
INSTALL SERVER

© Huxi LI, 2018
• Configure JDBC driver:
• sudo ambari-server setup --jdbc-db=postgres --jdbc-driver=/usr/share/java/postgresql-jdbc.jar
45
INSTALL SERVER

© Huxi LI, 2018
• Setup server:
• sudo ambary-server setup
46
INSTALL SERVER

© Huxi LI, 2018
• Setup server (continued):
47
INSTALL SERVER

© Huxi LI, 2018
• Start server:
48
INSTALL SERVER

© Huxi LI, 2018
• Check services binding:
49
INSTALL SERVER

© Huxi LI, 2018
ANNEXES
• Install Hadoop components
• Check cluster sanity
50

© Huxi LI, 2018
• Open Ambari management UI:
• http://hdp-master.huxili.com:8080
• User: admin
• Pwd: admin
A2 - HADOOP CLUSTER SETUP
51
STEP 1 – OPEN MANAGEMENT UI

© Huxi LI, 2018
52
STEP 2 – START CREATION WIZARD
Click Here

© Huxi LI, 2018
53
STEP 3 – CHOOSE CLUSTER NAME

© Huxi LI, 2018
54
STEP 4 – CHOOSE VERSION

© Huxi LI, 2018
55
STEP 5 – CHOOSE PARAMETERS

© Huxi LI, 2018
56
STEP 6 – AMBARI HOST CHECK (FAILED)
Error

© Huxi LI, 2018
• vi /etc/hostname and using the right hostname
• hdp-master.huxili.com
• hdp-slave1.huxili.com
• hdp-slave2.huxili.com
57
STEP 6 – UPDATE HOSTNAMES AND RETRY

© Huxi LI, 2018
58
STEP 7 – CHOOSE REQUIRED SERVICES

© Huxi LI, 2018
59
STEP 8 – DISTRIBUTION OF COMPONENTS

© Huxi LI, 2018
60
STEP 8 – DISTRIBUTION OF COMPONENTS

© Huxi LI, 2018
61
STEP 9 – CONFIGURE HIVE DB (POSTGRES)

© Huxi LI, 2018
62
STEP 10 – CONTINUED

© Huxi LI, 2018
63

© Huxi LI, 2018
64

© Huxi LI, 2018
65

© Huxi LI, 2018
66
STEP 14 – INSTALLATION FAILED

© Huxi LI, 2018
• Error messages :
• Solutions : manually install missing package
• sudo yum-config-manager --enable rhui-REGION-rhel-server-optional
• sudo yum install libtirpc-devel
67
STEP 15 – ANALYZE ERROR MESSAGE
Execution of '/usr/bin/yum -d 0 -e 0 -y install hadoop_2_6_4_0_91-client' returned 1. Error:
Package: hadoop_2_6_4_0_91-hdfs-2.7.3.2.6.4.0-91.x86_64 (HDP-2.6-repo-1)
Requires: libtirpc-devel
You could try using --skip-broken to work around the problem
You could try running: rpm -Va --nofiles --nodigest

© Huxi LI, 2018
68
STEP 16 – INSTALL MISSING PACKAGES

© Huxi LI, 2018
69
STEP 17 – INSTALL SUCCESS

© Huxi LI, 2018
70
STEP 18 – AMBARI DASHBOARD

© Huxi LI, 2018
71
STEP 19 – CREATE USER ‘HUXILI’

© Huxi LI, 2018
ANNEXES
• Install Hadoop components
• Check cluster sanity
72

© Huxi LI, 2018
• Open HDFS File View failed:
73
CHECK FILE VIEW

© Huxi LI, 2018
• Check HDFS:
• hdfs dfs -getfacl /user
• Create /user/huxili & admin
• su hdfs
• hdfs dfs -mkdir /user/huxili & hdfs dfs -chown huxili /user/huxili
• hdfs dfs -mkdir /user/admin & hdfs dfs -chown admin /user/admin
78
CHECK HIVE VIEW

© Huxi LI, 2018
• Service binding problems:
• Zookeeper listening only on IP6 and clustering join failed :
• Log error messages:
• (ClientCnxn.java:logStartConnect(1019)) - Opening socket connection to server hdp-slave1.huxili.coms/... Will not
attempt to authenticate using SASL (unknown error)
• Solution:
• Force zookeeper use IP4 by adding the following line in “Advanced zookeeper-env => template”:
‐ export SERVER_JVMFLAGS="$SERVER_JVMFLAGS -Djava.net.preferIPv4Stack=true"
82

© Huxi LI, 2018
• Force IP4 for the following services:
• ambari-infra-solr :
• SOLR_OPTS="$SOLR_OPTS -Djava.net.preferIPv4Stack=true“
• SmartSense:
• export ANALYZER_JAVA_OPTS="{{analyzer_jvm_opts}} -Djava.net.preferIPv4Stack=true -Xmx{{analyzer_jvm_heap}}m“
• export ZEPPELIN_JAVA_OPTS="-Dhdp.version={{hdp_version}} -Dlog.file.name=activity-explorer.log -
DSmartSenseActivityExplorer -Djava.net.preferIPv4Stack=true“
• YARN:
• YARN_OPTS="$YARN_OPTS -Djava.net.preferIPv4Stack=true“
• STORM:
• DRPC: -Xmx768m _JAAS_PLACEHOLDER -Djava.net.preferIPv4Stack=true
• Numbus: -Xmx1024m _JAAS_PLACEHOLDER -Djava.net.preferIPv4Stack=true
• Supervisor & UI & Storm-site/logviewer: -Djava.net.preferIPv4Stack=true
83

© Huxi LI, 2018
ANNEXES
A1|Ambari Server
• Store twitter data in HDFS
• Hive Query of Twitter data
• Twitter streaming with Storm
85

© Huxi LI, 2018
• Create account huxili using Ambari
• Create directories in HDFS :
• Login as hdfs (su hdfs)
• Create /user/huxili
• Change owner to huxili
• Now login to ambari as user ‘huxili’ and open ‘Files view’ from the menu. Since huxili is the owner of
the newly created directory, we can add data into this directory. I create the following directory to
save Twitter data:
• /user/huxili/twitter_data
A3 - STREAMING ARCHITECTURE HIVE/STORM
86
STORE TWITTER DATA

© Huxi LI, 2018
• Open Hive view 2.0 and execute:
• CREATE DATABASE `huxili`
• Create Table ‘tweets’ :
SET hive.support.sql11.reserved.keywords=false; DROP TABLE IF EXISTS tweets;
CREATE EXTERNAL TABLE tweets (createddate string, geolocation string, tweetmessage string, `user`
struct<geoenabled:boolean, id:int, name:string, screenname:string, userlocation:string>)
ROW FORMAT SERDE 'org.apache.hive.hcatalog.data.JsonSerDe' STORED AS TEXTFILE
LOCATION '/user/huxili/twitter_data/'
87
HIVE QUERY OF TWITTER DATA – CREATE TABLE

© Huxi LI, 2018
• SET hive.support.sql11.reserved.keywords=false;
SELECT DISTINCT tweetmessage, user.name, createddate FROM tweets WHERE user.name =
'Aimee_Cottle';
89
HIVE QUERY OF TWITTER DATA – QUERY DATA
Failed with error
messages

© Huxi LI, 2018
• Using another parser:
• curl http://www.congiu.net/hive-json-serde/1.3.8/cdh5/json-serde-1.3.8-jar-with-dependencies.jar -O
• curl http://www.congiu.net/hive-json-serde/1.3.8/cdh5/json-udf-1.3.8-jar-with-dependencies.jar -O
• cp *.jar /usr/hdp/2.6.4.0-91/hive/lib/ & cp *.jar /usr/hdp/2.6.4.0-91/hive2/lib/
• Restart Hive (Ambari)
• In Hive View2, execute:
• ADD JAR /usr/hdp/2.6.4.0-91/hive2/lib/json-serde-1.3.8-jar-with-dependencies.jar;
90
HIVE QUERY OF TWITTER DATA – TRY ANOTHER PARSER

© Huxi LI, 2018
• Create tables ‘tweets2’ using OpenX:
91
CREATE EXTERNAL TABLE tweets2 (createddate string, geolocation string, tweetmessage string,
`user` struct<geoenabled:boolean, id:int, name:string, screenname:string, userlocation:string>)
ROW FORMAT SERDE 'org.openx.data.jsonserde.JsonSerDe’
LOCATION '/user/huxili/twitter_data/'

© Huxi LI, 2018
• Query twitter data on the new table:
• SET hive.support.sql11.reserved.keywords=false; SELECT tweetmessage, user.name, createddate FROM
tweets2 WHERE user.screenname = 'Aimee_Cottle';
93
OK

© Huxi LI, 2018
• Objective:
• In this example, I will use STORM and twitter4j to analyzer real time tweets information, and then save the
analysis result to HDFS of the Hadoop cluster.
• Components:
• TwitterSampleSpout – This is a STORM Spout responsible for reading of twitter stream using twitter4j
(http://twitter4j.org).
• WordSplitterBolt – This is a STORM Bolt responsible for splitting tweets into words for analysis.
• IgnoreWordsBolt – This is a STORM Bolt responsible for filtering unwanted words.
• WordCounterBolt – This is a STORM Bolt responsible for word frequency analysis.
• HdfsWriterBolt – This is a STORM Bolt responsible for writing analysis result to HDFS.
94
STREAMING WITH STORM – TWITTER EXAMPLE

© Huxi LI, 2018
• STORM Topology:
96
StreamingSpout
Tweets Splitter
Bolt
Filter
Bolt
Counter
Bolt
Aggregation &
HDFS writing
Bolt
Hadoop HDFS

© Huxi LI, 2018
• Java Components:
• AppContext:
• Configuration of application.
• HdfsWriterBolt:
• Aggregation & Output to HDFS.
• TwitterSampleSpout:
• Streaming of tweets using twitter4j.
• WordSplitterBolt, IgnoreWordsBolt,
WordCounterBolt :
• Splitting, filtering, and counting bolts.
• Resources:
• Configuration files for twitter4j and OAUTH
authentication.
97

© Huxi LI, 2018
• Procedure of execution:
1. Create shaded jar (mvn clean install -Pcluster)
2. Uploading the generated application to HDFS using Ambari File View
3. Login onto slave1 (storm installed on slave1)
4. Perform the following operations:
‐ sudo su
‐ hdfs dfs -get /user/huxili/storm_apps/*.jar
‐ storm jar twitter-stream-example.jar com.huxili.storm.twitter.example.Topology
5. Observe the execution (see logs)
6. When the execution terminate, check the output file saved in HDFS
(/user/huxili/storm_out/counts.txt)
101

Hortonworks HDP, Is it goog enough ?

Recommended

Recommended

More Related Content

What's hot

What's hot (20)

Similar to Hortonworks HDP, Is it goog enough ?

Similar to Hortonworks HDP, Is it goog enough ? (20)

Recently uploaded

Recently uploaded (20)

Hortonworks HDP, Is it goog enough ?