SlideShare a Scribd company logo
© Huxi LI, 2018
HADOOP BIG DATA PLATFORM
Hortonworks HDP, Is it good enough ?
© Huxi LI, 2018
AGENDA
1|Why Hadoop ?
2|Hortonworks HDP, is it good enough ?
3|Hadoop cluster provision with Ambari
4|Demos of typical use cases
Annexes
2
© Huxi LI, 2018
Digitalization
INDUSTRIAL TRENDS
• Digitalization of core business: Going digital is now a core strategy
for many organizations around the world, including financial et insurance
core business. Most successful enterprises deliver their core business
through digital channels.
• Digitalization of society: Social networks, collaborative platforms, e-
government, and smart devices are changing the entire society.
- Social network is everywhere and start impacting the business model;
- Industrialized countries are pushing the digitalization of their entire public
services as a way of modernizing governments and better serving their citizens.
• Internet Of Thing (IoT): The internet are expanding to physical objects;
experts estimate that the IoT will consist of about 30 billion objects by 2020
(IEEE, 2016).
• Artificial Intelligence (AI): The maturity of AI reaches another level,
from academic experimental stage into massive practical usage era. AI not
only need massive data and but also produce massive data.
3
A matter of Data
© Huxi LI, 2018
AGE OF BIG DATA
4
Big Volume
High Variety
Uncertain Veracity
High Velocity
In 2016, global mobile data traffic
amounted to 7 exabytes per month. In
2021, mobile data traffic worldwide is
expected to reach 49 exabytes per month
at a compound annual growth rate of 47
percent (Statista, 2018)
Variety of data can be numerous - structured,
semi-structured and mostly unstructured
data as well.
Velocity of data producing can be variable
from slow batch processing, near real-time
streaming, to real-time IoT sensors.
Veracity is one of the unfortunate
characteristics of today’s data. Not all data
are trustworthy.
© Huxi LI, 2018
HADOOP = DESIGNED FOR BIG DATA
Hadoop allows for the distributed processing of large data sets across clusters
of computers using simple programming models. It is designed to scale up to
thousands of machines, each offering local computation and storage.
• Massive Storage (Volume): Hadoop’s distributed file system (HDFS), is
specially designed to store massive data on commodity hardware, providing
high-throughput access to application data.
• Any kind of Data (Variety): You can save text, image, video, XML, JSON and
any other type of data in HDFS, as if you are working on a unix-like file system
- this is very important because of the variety of today’s data.
• Massively Scalable Computation: Parallel computing is integrated into
Hadoop core system – YARN for cluster resource management and
MapReduce for parallel processing of large data sets. And many other derived
tools such as Storm, Hive, and so on you can choose from.
5
© Huxi LI, 2018
HADOOP = A MATURE SYSTEM
• Production-proved: Hadoop has been
confirmed in production by many
companies around the world.
• Available on Cloud: Hadoop are now
deployed on all major cloud platforms
such Microsoft AZURE, Amazon AWS,
IBM Cloud, and Google GCP.
6
© Huxi LI, 2018
HADOOP = A RICH ECOSYSTEM
The term Hadoop refers to the entire ecosystem or collection of software packages that can be
installed on top of or alongside Hadoop.
Many choices :
• Ingestion : Scoop, Flume, Kafka, NIFI, Storm, Flink, …
• Streaming & Compute: MapReduce, Storm, Spark, Flink, Nifi, …
• Analysis/SQL-on-Hadoop: Impala, Hive, Drill, …
• Machine learning: Mahout, Spark ML, …
• NoSQL: HBase, Cassandra, …
• Management: Ambari, HUE, …
7
© Huxi LI, 2018
VALUE PROPOSITIONS OF HADOOP
8
Designed for
Big Data
• Massively scalable
storage and
computation
• Carefully designed for
large scale
deployment
High
Maturity
• Production-proved,
highly robust
• Confirmed by big
clusters with
thousands of nodes
Rich
Ecosystem
• Ecosystem formed
around Hadoop with
many choices, with
enterprise-grade
quality but freely
available
• Wide supports,
commercially or of
community.
High
Scalability
• Technically, a single
Hadoop cluster can
be scaled to
thousands of nodes
• Functionally, Hadoop
support any kind of
data, scalable to a
wide range of use
cases.
© Huxi LI, 2018
AGENDA
1|Why Hadoop ?
2|Hortonworks HDP, is it good enough ?
3|Hadoop cluster provision with Ambari
4|Demos of typical use cases
Annexes
9
© Huxi LI, 2018
“Enterprises like Hortonworks’ storage and
compute processing, broad data ingestion,
data governance, and open source support
when deploying BDW.” - Forrester 2017
HDP = ENTERPRISE-GRADE HADOOP + 100% OPEN SOURCE
Enterprise-grade
Hadoop
Hortonworks HDP is the only
Enterprise-Grade Hadoop
distribution of fully Open
Source
Apache License
All of the technology built into
the Hortonworks Data Platform
is an Apache open source
project
100% Open Source
Hortonworks HDP is the only
Enterprise-Grade Hadoop
distribution of fully Open
Source
No Vendor Lock-In
HDP delivers enterprise-grade
software that fosters
innovation and prevents vendor
lock-in
HDP
© Huxi LI, 2018
HDP = CAREFULLY ENGINEERED SECURITY ARCHITECTURE
• Platform built-in security and data governance.
Built-In Security
• Support integration into enterprise’s existing security system with support of industrial
standards, such as Kerberos and LDAP.
Corporate Integration
• Centralized cluster management, monitoring, and access control.
Centralized control
• Simple yet powerful, Unix-like, resource permission control (HDFS)
Simplicity
© Huxi LI, 2018
HDP = LEADER OF BIG DATA WAREHOUSE
• Hortonworks was ranked as a leader in The Forrester Wave™:
Big Data Warehouse (Q2 2017).
12
“Hortonworks delivers actionable intelligence from all kinds of
data-in-motion and data-at-rest. Through its open source strategy,
Hortonworks continually evolves its offering by working closely
with partners across the EDW ecosystem of tools and vendors. The
vendor provides a cost-effective, nimble, and scalable architecture
to implement big data warehouses, whether on-premises or in the
cloud.” - Forrester 2017
© Huxi LI, 2018
HORTONWORKS HDP
13
© Huxi LI, 2018
VALUE PROPOSITIONS OF HORTONWORKS HDP
Cost
• No license fee
• Pay for support
Openness
• Open source
• With commercial
friendly Apache
license
Expertise
• Key contributors of
Hadoop
• Decade of industrial
Big Data experience
Security
• Guarantee of a
market leader
• Mature technology,
Support on premise,
cloud, or appliance.
© Huxi LI, 2018
AGENDA
1|Why Hadoop ?
2|Hortonworks HDP, is it good for me ?
3|Hadoop Cluster Provision with Ambari
4|Demos of typical use cases
Annexes
15
© Huxi LI, 2018
AMBARI HADOOP CLUSTER PROVISION
Explore
Hadoop
Cluster
With
Ambari
Create
Hadoop
Cluster
With
Ambari
Install
Ambari
server
Prepare VMs
Ambari
1
2
3
4
Manual, Ansible, Puppet …
© Huxi LI, 2018
TYPICAL HADOOP CLUSTERS
• Hadoop backbone: The Hadoop backbone is typically
composed of HDFS, MapReduce and YARN, i.e.
the core of Hadoop cluster:
• HDFS for massive storage service
• MapReduce for parallel computing of datasets
• YARN for job scheduling or resource Mgt
• Edge nodes: Edge nodes are part of the Hadoop
cluster, but they are typically connected to the rest
of the corporate network, providing operational
environments to end users.
• Ambari is a typical edge node providing cluster
management service
• Scoop & Flume are another kind of edge nodes providing dataflow IN or OUT of the cluster
• Spark & Hive are edge nodes serving applications or business users
17
Corporate Network
Administration Applications
Edge nodes
(e.g. Hive)
Management
(e.g. Ambari)
© Huxi LI, 2018
A PROOF-OF-CONCEPT HADOOP CLUSTER
• POC Hadoop cluster:
• Composed of 3 nodes,
• Share VMs for edge and backbone nodes
• Hortonworks HDP 2.6.1.3
• AWS VPC 172.31.0.0/16
• Provisioner:
• Ambari as provisioner of the cluster
18
VPC: 172.31.0.0/16
hdp-master (172.31.45.112)
hdp-slave1
(172.31.37.152)
hdp-slave2
(172.31.45.181)
Namenode,
Ambari, etc.
Datanode,
Hive, Storm,
etc.
Datanode,
Nodemanager,
etc.
© Huxi LI, 2018
MACHINE SETUP
• Provision VMs with AWS:
• 3 EC2 instances
• Redhat 7.0 Linux
• 1 VPC allowing traffic among nodes
19
© Huxi LI, 2018
NETWORK TRAFFIC SETUP
• Authorized traffic (AWS):
20
Cluster internal traffic
© Huxi LI, 2018
INSTALL AMBARI SERVER
• Follow the official installation guide for details
21
© Huxi LI, 2018
SETUP HADOOP CLUSTER WITH AMBARI
22
Click Here
Ambari Web UI: http://hdp-master.huxili.com:8080
© Huxi LI, 2018
SETUP HADOOP CLUSTER WITH AMBARI
23
Choose Hadoop nodes
© Huxi LI, 2018
EXPLORE HADOOP CLUSTER WITH AMBARI
24
© Huxi LI, 2018
AGENDA
1|Why Hadoop ?
2|Hortonworks HDP, is it good for me ?
3|Hadoop Cluster Provision with Ambari
4|Demos of typical use cases
Annexes
25
© Huxi LI, 201826
1. Data storage in HDFS …
© Huxi LI, 2018
DEMO 1 - SAVING DATA IN HDFS
Ambari UI
Upload
Files
Create
directory
File View
• Open Ambari UI at http://hdp-master.huxili.com
Manage
with
Ambari
• Open File View from Ambari UI
• Create a directory in HDFS : /user/huxili/dfs_demo
• Manipulate uploaded files (rename, permission, Ctrl+Mouse => Select directory)
Set permission
• Upload files into the above create directory
© Huxi LI, 201828
2. JSON Data query with Hive …
© Huxi LI, 2018
DEMO 2 – DATA QUERY WITH HIVE
Ambari UI
Create
table for
JSON
data
Create
schema
for
DEMO
Hive
View
• Check sample twitter data (JSON) in /user/huxili/twitter_data
Hive Query
with
Ambari
• Open Ambari Hive View 2.0
• Create a database for demo purpose -> create database demo;
• Query JSON data in Hive
Query JSON
Data in Hive
• Create an external table demo_tweets pointing to /user/huxili/twitter_data
© Huxi LI, 201830
3. Data Streaming with Storm …
© Huxi LI, 2018
DEMO 3 – STREAMING OF TWEETS WITH STORM
31
StreamingSpout
Tweets Splitter
Bolt
Filter
Bolt
Counter
Bolt
Aggregation &
HDFS writing
Bolt
Hadoop HDFS
(File: /user/huxili/storm_out/counts.txt)
© Huxi LI, 2018
DEMO 3 – DATA STREAMING WITH STORM
Ambari UI
SSH into
Storm
Host
Upload
Streaming
Example
File View
• Open Ambari File View -> Upload example streaming jar
Explore
Results in
Ambari
• SSH into hdp-slave1 (Storm host)
• su root & download the streaming example (hdfs dfs -get /user/huxili/storm_apps/tw*.jar)
• Streaming will stop after 30 seconds and explore results in Ambari
Submit
streaming App
to Storm
• Submit the streaming to storm
© Huxi LI, 201833
Annexes
© Huxi LI, 2018
ANNEXES
A1|Ambari Server Setup
• Provision VMs on AWS
• Install required packages
• Install Ambari Server
A2|Hadoop Cluster Setup
A3|Streaming Architecture Hive/Storm
34
© Huxi LI, 2018
• Provision VMs:
• Using AWS EC2 instance
• Redhat 7.0 Linux
• AWS / VPC, 3 nodes
A1 - AMBARI SERVER SETUP
35
MACHINES SETUP
© Huxi LI, 2018
• Updating /etc/hosts (All nodes):
• 172.31.45.112 hdp-master.huxili.com
• 172.31.37.152 hdp-slave1.huxili.com
• 172.31.45.181 hdp-slave2.huxili.com
• Updating /etc/hostname (All node):
• hdp-master.huxili.com (Master)
• hdp-slave1.huxili.com (slave1)
• hdp-slave2.huxili.com (slave2)
A1 - AMBARI SERVER SETUP
36
MACHINES SETUP
© Huxi LI, 2018
• Authorized traffic (AWS):
A1 - AMBARI SERVER SETUP
37
NETWORK TRAFFIC SETUP
Cluster internal traffic
© Huxi LI, 2018
ANNEXES
A1|Ambari Server Setup
• Provision VMs on AWS
• Install required packages
• Install Ambari Server
A2|Hadoop Cluster Setup
A3|Streaming Architecture Hive/Storm
38
© Huxi LI, 2018
• Hdp-master.huxli.com:
• sudo yum install -y ntp
• sudo systemctl enable ntpd
• sudo yum install -y postgresql-jdbc
• sudo yum install -y wget
• sudo wget -nv http://public-repo-1.hortonworks.com/ambari/centos7/2.x/updates/2.6.1.3/ambari.repo -O
/etc/yum.repos.d/ambari.repo
• sudo yum install -y ambari-server
A1 - AMBARI SERVER SETUP
39
INSTALL REQUIRED PACKAGES
© Huxi LI, 2018
• hdp-slave1.huxili.com & hdp-slave2.huxili.com:
• sudo yum install -y ntp
• sudo systemctl enable ntpd
A1 - AMBARI SERVER SETUP
40
INSTALL REQUIRED PACKAGES
© Huxi LI, 2018
A1 - AMBARI SERVER SETUP
41
INSTALL REQUIRED PACKAGES
• hdp-slave1.huxili.com: Install hive metastore DB (postgres)
© Huxi LI, 2018
A1 - AMBARI SERVER SETUP
42
INSTALL REQUIRED PACKAGES
• hdp-slave1.huxili.com: Check Hive DB
© Huxi LI, 2018
ANNEXES
A1|Ambari Server Setup
• Provision VMs on AWS
• Install required packages
• Install Ambari Server
A2|Setup Hadoop Cluster
A3|Streaming Architecture Hive/Storm
43
© Huxi LI, 2018
• Commands to run:
A1 - AMBARI SERVER SETUP
44
INSTALL SERVER
© Huxi LI, 2018
• Configure JDBC driver:
• sudo ambari-server setup --jdbc-db=postgres --jdbc-driver=/usr/share/java/postgresql-jdbc.jar
A1 - AMBARI SERVER SETUP
45
INSTALL SERVER
© Huxi LI, 2018
• Setup server:
• sudo ambary-server setup
A1 - AMBARI SERVER SETUP
46
INSTALL SERVER
© Huxi LI, 2018
• Setup server (continued):
A1 - AMBARI SERVER SETUP
47
INSTALL SERVER
© Huxi LI, 2018
• Start server:
A1 - AMBARI SERVER SETUP
48
INSTALL SERVER
© Huxi LI, 2018
• Check services binding:
A1 - AMBARI SERVER SETUP
49
INSTALL SERVER
© Huxi LI, 2018
ANNEXES
A1|Ambari Server Setup
A2|Hadoop Cluster Setup
• Install Hadoop components
• Check cluster sanity
A3|Streaming Architecture Hive/Storm
50
© Huxi LI, 2018
• Open Ambari management UI:
• http://hdp-master.huxili.com:8080
• User: admin
• Pwd: admin
A2 - HADOOP CLUSTER SETUP
51
STEP 1 – OPEN MANAGEMENT UI
© Huxi LI, 2018
A2 - HADOOP CLUSTER SETUP
52
STEP 2 – START CREATION WIZARD
Click Here
© Huxi LI, 2018
A2 - HADOOP CLUSTER SETUP
53
STEP 3 – CHOOSE CLUSTER NAME
© Huxi LI, 2018
A2 - HADOOP CLUSTER SETUP
54
STEP 4 – CHOOSE VERSION
© Huxi LI, 2018
A2 - HADOOP CLUSTER SETUP
55
STEP 5 – CHOOSE PARAMETERS
© Huxi LI, 2018
A2 - HADOOP CLUSTER SETUP
56
STEP 6 – AMBARI HOST CHECK (FAILED)
Error
© Huxi LI, 2018
• vi /etc/hostname and using the right hostname
• hdp-master.huxili.com
• hdp-slave1.huxili.com
• hdp-slave2.huxili.com
A2 - HADOOP CLUSTER SETUP
57
STEP 6 – UPDATE HOSTNAMES AND RETRY
© Huxi LI, 2018
A2 - HADOOP CLUSTER SETUP
58
STEP 7 – CHOOSE REQUIRED SERVICES
© Huxi LI, 2018
A2 - HADOOP CLUSTER SETUP
59
STEP 8 – DISTRIBUTION OF COMPONENTS
© Huxi LI, 2018
A2 - HADOOP CLUSTER SETUP
60
STEP 8 – DISTRIBUTION OF COMPONENTS
© Huxi LI, 2018
A2 - HADOOP CLUSTER SETUP
61
STEP 9 – CONFIGURE HIVE DB (POSTGRES)
© Huxi LI, 2018
A2 - HADOOP CLUSTER SETUP
62
STEP 10 – CONTINUED
© Huxi LI, 2018
A2 - HADOOP CLUSTER SETUP
63
STEP 11 – CONTINUED
© Huxi LI, 2018
A2 - HADOOP CLUSTER SETUP
64
STEP 12 – CONTINUED
© Huxi LI, 2018
A2 - HADOOP CLUSTER SETUP
65
STEP 13 – CONTINUED
© Huxi LI, 2018
A2 - HADOOP CLUSTER SETUP
66
STEP 14 – INSTALLATION FAILED
© Huxi LI, 2018
• Error messages :
• Solutions : manually install missing package
• sudo yum-config-manager --enable rhui-REGION-rhel-server-optional
• sudo yum install libtirpc-devel
A2 - HADOOP CLUSTER SETUP
67
STEP 15 – ANALYZE ERROR MESSAGE
Execution of '/usr/bin/yum -d 0 -e 0 -y install hadoop_2_6_4_0_91-client' returned 1. Error:
Package: hadoop_2_6_4_0_91-hdfs-2.7.3.2.6.4.0-91.x86_64 (HDP-2.6-repo-1)
Requires: libtirpc-devel
You could try using --skip-broken to work around the problem
You could try running: rpm -Va --nofiles --nodigest
© Huxi LI, 2018
A2 - HADOOP CLUSTER SETUP
68
STEP 16 – INSTALL MISSING PACKAGES
© Huxi LI, 2018
A2 - HADOOP CLUSTER SETUP
69
STEP 17 – INSTALL SUCCESS
© Huxi LI, 2018
A2 - HADOOP CLUSTER SETUP
70
STEP 18 – AMBARI DASHBOARD
© Huxi LI, 2018
A2 - HADOOP CLUSTER SETUP
71
STEP 19 – CREATE USER ‘HUXILI’
© Huxi LI, 2018
ANNEXES
A1|Ambari Server Setup
A2|Hadoop Cluster Setup
• Install Hadoop components
• Check cluster sanity
A3|Streaming Architecture Hive/Storm
72
© Huxi LI, 2018
• Open HDFS File View failed:
A2 - HADOOP CLUSTER SETUP
73
CHECK FILE VIEW
© Huxi LI, 2018
• Check view definitions (cluster name is empty):
A2 - HADOOP CLUSTER SETUP
74
CHECK FILE VIEW
© Huxi LI, 2018
• HDFS File View after filling cluster name:
A2 - HADOOP CLUSTER SETUP
75
CHECK FILE VIEW
© Huxi LI, 2018
• Open Hive View failed:
• Missing ‘/user/huxili’
A2 - HADOOP CLUSTER SETUP
76
CHECK HIVE VIEW
© Huxi LI, 2018
• Create directory for custom Hive user:
• hdfs dfs -ls / && hdfs dfs -ls /user
A2 - HADOOP CLUSTER SETUP
77
CHECK HIVE VIEW
© Huxi LI, 2018
• Check HDFS:
• hdfs dfs -getfacl /user
• Create /user/huxili & admin
• su hdfs
• hdfs dfs -mkdir /user/huxili & hdfs dfs -chown huxili /user/huxili
• hdfs dfs -mkdir /user/admin & hdfs dfs -chown admin /user/admin
A2 - HADOOP CLUSTER SETUP
78
CHECK HIVE VIEW
© Huxi LI, 2018
• Open Hive View OK after creation of missing directory:
A2 - HADOOP CLUSTER SETUP
79
CHECK HIVE VIEW
© Huxi LI, 2018
• Open Hive View 2.0 (OK) :
A2 - HADOOP CLUSTER SETUP
80
CHECK HIVE VIEW 2.0
© Huxi LI, 2018
• Check service binding:
• netstat -tulpn
A2 - HADOOP CLUSTER SETUP
81
CHECK SERVICE BINDING
© Huxi LI, 2018
• Service binding problems:
• Zookeeper listening only on IP6 and clustering join failed :
• Log error messages:
• (ClientCnxn.java:logStartConnect(1019)) - Opening socket connection to server hdp-slave1.huxili.coms/... Will not
attempt to authenticate using SASL (unknown error)
• Solution:
• Force zookeeper use IP4 by adding the following line in “Advanced zookeeper-env => template”:
‐ export SERVER_JVMFLAGS="$SERVER_JVMFLAGS -Djava.net.preferIPv4Stack=true"
A2 - HADOOP CLUSTER SETUP
82
CHECK SERVICE BINDING
© Huxi LI, 2018
• Force IP4 for the following services:
• ambari-infra-solr :
• SOLR_OPTS="$SOLR_OPTS -Djava.net.preferIPv4Stack=true“
• SmartSense:
• export ANALYZER_JAVA_OPTS="{{analyzer_jvm_opts}} -Djava.net.preferIPv4Stack=true -Xmx{{analyzer_jvm_heap}}m“
• export ZEPPELIN_JAVA_OPTS="-Dhdp.version={{hdp_version}} -Dlog.file.name=activity-explorer.log -
DSmartSenseActivityExplorer -Djava.net.preferIPv4Stack=true“
• YARN:
• YARN_OPTS="$YARN_OPTS -Djava.net.preferIPv4Stack=true“
• STORM:
• DRPC: -Xmx768m _JAAS_PLACEHOLDER -Djava.net.preferIPv4Stack=true
• Numbus: -Xmx1024m _JAAS_PLACEHOLDER -Djava.net.preferIPv4Stack=true
• Supervisor & UI & Storm-site/logviewer: -Djava.net.preferIPv4Stack=true
A2 - HADOOP CLUSTER SETUP
83
CHECK SERVICE BINDING
© Huxi LI, 2018
A2 - HADOOP CLUSTER SETUP
84
SERVICE INSTALL SUCCESSFUL !
© Huxi LI, 2018
ANNEXES
A1|Ambari Server
A2|Hadoop Cluster Setup
A3|Streaming Architecture Hive/Storm
• Store twitter data in HDFS
• Hive Query of Twitter data
• Twitter streaming with Storm
85
© Huxi LI, 2018
• Create account huxili using Ambari
• Create directories in HDFS :
• Login as hdfs (su hdfs)
• Create /user/huxili
• Change owner to huxili
• Now login to ambari as user ‘huxili’ and open ‘Files view’ from the menu. Since huxili is the owner of
the newly created directory, we can add data into this directory. I create the following directory to
save Twitter data:
• /user/huxili/twitter_data
A3 - STREAMING ARCHITECTURE HIVE/STORM
86
STORE TWITTER DATA
© Huxi LI, 2018
• Open Hive view 2.0 and execute:
• CREATE DATABASE `huxili`
• Create Table ‘tweets’ :
SET hive.support.sql11.reserved.keywords=false; DROP TABLE IF EXISTS tweets;
CREATE EXTERNAL TABLE tweets (createddate string, geolocation string, tweetmessage string, `user`
struct<geoenabled:boolean, id:int, name:string, screenname:string, userlocation:string>)
ROW FORMAT SERDE 'org.apache.hive.hcatalog.data.JsonSerDe' STORED AS TEXTFILE
LOCATION '/user/huxili/twitter_data/'
A3 - STREAMING ARCHITECTURE HIVE/STORM
87
HIVE QUERY OF TWITTER DATA – CREATE TABLE
© Huxi LI, 2018
• Create Table ‘tweets’ OK:
A3 - STREAMING ARCHITECTURE HIVE/STORM
88
HIVE QUERY OF TWITTER DATA – CREATE TABLE CONTINUED
© Huxi LI, 2018
• SET hive.support.sql11.reserved.keywords=false;
SELECT DISTINCT tweetmessage, user.name, createddate FROM tweets WHERE user.name =
'Aimee_Cottle';
A3 - STREAMING ARCHITECTURE HIVE/STORM
89
HIVE QUERY OF TWITTER DATA – QUERY DATA
Failed with error
messages
© Huxi LI, 2018
• Using another parser:
• curl http://www.congiu.net/hive-json-serde/1.3.8/cdh5/json-serde-1.3.8-jar-with-dependencies.jar -O
• curl http://www.congiu.net/hive-json-serde/1.3.8/cdh5/json-udf-1.3.8-jar-with-dependencies.jar -O
• cp *.jar /usr/hdp/2.6.4.0-91/hive/lib/ & cp *.jar /usr/hdp/2.6.4.0-91/hive2/lib/
• Restart Hive (Ambari)
• In Hive View2, execute:
• ADD JAR /usr/hdp/2.6.4.0-91/hive2/lib/json-serde-1.3.8-jar-with-dependencies.jar;
A3 - STREAMING ARCHITECTURE HIVE/STORM
90
HIVE QUERY OF TWITTER DATA – TRY ANOTHER PARSER
© Huxi LI, 2018
• Create tables ‘tweets2’ using OpenX:
A3 - STREAMING ARCHITECTURE HIVE/STORM
91
HIVE QUERY OF TWITTER DATA – TRY ANOTHER PARSER
CREATE EXTERNAL TABLE tweets2 (createddate string, geolocation string, tweetmessage string,
`user` struct<geoenabled:boolean, id:int, name:string, screenname:string, userlocation:string>)
ROW FORMAT SERDE 'org.openx.data.jsonserde.JsonSerDe’
LOCATION '/user/huxili/twitter_data/'
© Huxi LI, 2018
• Query twitter data on the new table:
• SELECT * FROM tweets2 LIMIT 100;
• It works !
A3 - STREAMING ARCHITECTURE HIVE/STORM
92
HIVE QUERY OF TWITTER DATA – TRY ANOTHER PARSER
© Huxi LI, 2018
• Query twitter data on the new table:
• SET hive.support.sql11.reserved.keywords=false; SELECT tweetmessage, user.name, createddate FROM
tweets2 WHERE user.screenname = 'Aimee_Cottle';
A3 - STREAMING ARCHITECTURE HIVE/STORM
93
HIVE QUERY OF TWITTER DATA – TRY ANOTHER PARSER
OK
© Huxi LI, 2018
• Objective:
• In this example, I will use STORM and twitter4j to analyzer real time tweets information, and then save the
analysis result to HDFS of the Hadoop cluster.
• Components:
• TwitterSampleSpout – This is a STORM Spout responsible for reading of twitter stream using twitter4j
(http://twitter4j.org).
• WordSplitterBolt – This is a STORM Bolt responsible for splitting tweets into words for analysis.
• IgnoreWordsBolt – This is a STORM Bolt responsible for filtering unwanted words.
• WordCounterBolt – This is a STORM Bolt responsible for word frequency analysis.
• HdfsWriterBolt – This is a STORM Bolt responsible for writing analysis result to HDFS.
A3 - STREAMING ARCHITECTURE HIVE/STORM
94
STREAMING WITH STORM – TWITTER EXAMPLE
© Huxi LI, 2018
• Setup of twitter streaming (OAUTH):
A3 - STREAMING ARCHITECTURE HIVE/STORM
95
STREAMING WITH STORM – TWITTER EXAMPLE
© Huxi LI, 2018
• STORM Topology:
A3 - STREAMING ARCHITECTURE HIVE/STORM
96
STREAMING WITH STORM – TWITTER EXAMPLE
StreamingSpout
Tweets Splitter
Bolt
Filter
Bolt
Counter
Bolt
Aggregation &
HDFS writing
Bolt
Hadoop HDFS
© Huxi LI, 2018
• Java Components:
• AppContext:
• Configuration of application.
• HdfsWriterBolt:
• Aggregation & Output to HDFS.
• TwitterSampleSpout:
• Streaming of tweets using twitter4j.
• WordSplitterBolt, IgnoreWordsBolt,
WordCounterBolt :
• Splitting, filtering, and counting bolts.
• Resources:
• Configuration files for twitter4j and OAUTH
authentication.
A3 - STREAMING ARCHITECTURE HIVE/STORM
97
STREAMING WITH STORM – TWITTER EXAMPLE
© Huxi LI, 2018
• STORM Topology (Java):
A3 - STREAMING ARCHITECTURE HIVE/STORM
98
STREAMING WITH STORM – TWITTER EXAMPLE
© Huxi LI, 2018
• Writing to HDSF (Java):
A3 - STREAMING ARCHITECTURE HIVE/STORM
99
STREAMING WITH STORM – TWITTER EXAMPLE
© Huxi LI, 2018
• Packaging and execution:
• Using maven
• Plugin shade
A3 - STREAMING ARCHITECTURE HIVE/STORM
100
STREAMING WITH STORM – TWITTER EXAMPLE
© Huxi LI, 2018
• Procedure of execution:
1. Create shaded jar (mvn clean install -Pcluster)
2. Uploading the generated application to HDFS using Ambari File View
3. Login onto slave1 (storm installed on slave1)
4. Perform the following operations:
‐ sudo su
‐ hdfs dfs -get /user/huxili/storm_apps/*.jar
‐ storm jar twitter-stream-example.jar com.huxili.storm.twitter.example.Topology
5. Observe the execution (see logs)
6. When the execution terminate, check the output file saved in HDFS
(/user/huxili/storm_out/counts.txt)
A3 - STREAMING ARCHITECTURE HIVE/STORM
101
STREAMING WITH STORM – TWITTER EXAMPLE
© Huxi LI, 2018
• Example execution logs:
A3 - STREAMING ARCHITECTURE HIVE/STORM
102
STREAMING WITH STORM – TWITTER EXAMPLE
© Huxi LI, 2018
• Example execution logs:
A3 - STREAMING ARCHITECTURE HIVE/STORM
103
STREAMING WITH STORM – TWITTER EXAMPLE
© Huxi LI, 2018
• Example results:
A3 - STREAMING ARCHITECTURE HIVE/STORM
104
STREAMING WITH STORM – TWITTER EXAMPLE
© Huxi LI, 2018105
Progress. Together.

More Related Content

What's hot

Hortonworks Technical Workshop - Operational Best Practices Workshop
Hortonworks Technical Workshop - Operational Best Practices WorkshopHortonworks Technical Workshop - Operational Best Practices Workshop
Hortonworks Technical Workshop - Operational Best Practices Workshop
Hortonworks
 
Powering Fast Data and the Hadoop Ecosystem with VoltDB and Hortonworks
Powering Fast Data and the Hadoop Ecosystem with VoltDB and HortonworksPowering Fast Data and the Hadoop Ecosystem with VoltDB and Hortonworks
Powering Fast Data and the Hadoop Ecosystem with VoltDB and Hortonworks
Hortonworks
 
Splunk-hortonworks-risk-management-oct-2014
Splunk-hortonworks-risk-management-oct-2014Splunk-hortonworks-risk-management-oct-2014
Splunk-hortonworks-risk-management-oct-2014
Hortonworks
 
Hortonworks on IBM POWER Analytics / AI
Hortonworks on IBM POWER Analytics / AIHortonworks on IBM POWER Analytics / AI
Hortonworks on IBM POWER Analytics / AI
DataWorks Summit
 
Hadoop: The Unintended Benefits
Hadoop: The Unintended BenefitsHadoop: The Unintended Benefits
Hadoop: The Unintended Benefits
DataWorks Summit
 
Containers and Big Data
Containers and Big DataContainers and Big Data
Containers and Big Data
DataWorks Summit
 
Containers and Big Data
Containers and Big Data Containers and Big Data
Containers and Big Data
DataWorks Summit
 
Data in the Cloud Crash Course
Data in the Cloud Crash CourseData in the Cloud Crash Course
Data in the Cloud Crash Course
DataWorks Summit
 
Big Data Expo 2015 - Hortonworks Common Hadoop Use Cases
Big Data Expo 2015 - Hortonworks Common Hadoop Use CasesBig Data Expo 2015 - Hortonworks Common Hadoop Use Cases
Big Data Expo 2015 - Hortonworks Common Hadoop Use Cases
BigDataExpo
 
Enabling the Real Time Analytical Enterprise
Enabling the Real Time Analytical EnterpriseEnabling the Real Time Analytical Enterprise
Enabling the Real Time Analytical Enterprise
Hortonworks
 
Fast SQL on Hadoop, really?
Fast SQL on Hadoop, really?Fast SQL on Hadoop, really?
Fast SQL on Hadoop, really?
DataWorks Summit
 
10 Lessons Learned from Meeting with 150 Banks Across the Globe
10 Lessons Learned from Meeting with 150 Banks Across the Globe10 Lessons Learned from Meeting with 150 Banks Across the Globe
10 Lessons Learned from Meeting with 150 Banks Across the Globe
DataWorks Summit
 
Apache Hadoop YARN: state of the union
Apache Hadoop YARN: state of the unionApache Hadoop YARN: state of the union
Apache Hadoop YARN: state of the union
DataWorks Summit
 
Big Data Platform Processes Daily Healthcare Data for Clinic Use at Mayo Clinic
Big Data Platform Processes Daily Healthcare Data for Clinic Use at Mayo ClinicBig Data Platform Processes Daily Healthcare Data for Clinic Use at Mayo Clinic
Big Data Platform Processes Daily Healthcare Data for Clinic Use at Mayo Clinic
DataWorks Summit
 
Running Enterprise Workloads with an open source Hybrid Cloud Data Architectu...
Running Enterprise Workloads with an open source Hybrid Cloud Data Architectu...Running Enterprise Workloads with an open source Hybrid Cloud Data Architectu...
Running Enterprise Workloads with an open source Hybrid Cloud Data Architectu...
DataWorks Summit
 
YARN Ready: Integrating to YARN with Tez
YARN Ready: Integrating to YARN with Tez YARN Ready: Integrating to YARN with Tez
YARN Ready: Integrating to YARN with Tez
Hortonworks
 
Hortonworks Data In Motion Webinar Series Pt. 2
Hortonworks Data In Motion Webinar Series Pt. 2Hortonworks Data In Motion Webinar Series Pt. 2
Hortonworks Data In Motion Webinar Series Pt. 2
Hortonworks
 
Unlock Value from Big Data with Apache NiFi and Streaming CDC
Unlock Value from Big Data with Apache NiFi and Streaming CDCUnlock Value from Big Data with Apache NiFi and Streaming CDC
Unlock Value from Big Data with Apache NiFi and Streaming CDC
Hortonworks
 
Discover HDP 2.1: Apache Hadoop 2.4.0, YARN & HDFS
Discover HDP 2.1: Apache Hadoop 2.4.0, YARN & HDFSDiscover HDP 2.1: Apache Hadoop 2.4.0, YARN & HDFS
Discover HDP 2.1: Apache Hadoop 2.4.0, YARN & HDFS
Hortonworks
 
Hortonworks DataFlow (HDF) 3.3 - Taking Stream Processing to the Next Level
Hortonworks DataFlow (HDF) 3.3 - Taking Stream Processing to the Next LevelHortonworks DataFlow (HDF) 3.3 - Taking Stream Processing to the Next Level
Hortonworks DataFlow (HDF) 3.3 - Taking Stream Processing to the Next Level
Hortonworks
 

What's hot (20)

Hortonworks Technical Workshop - Operational Best Practices Workshop
Hortonworks Technical Workshop - Operational Best Practices WorkshopHortonworks Technical Workshop - Operational Best Practices Workshop
Hortonworks Technical Workshop - Operational Best Practices Workshop
 
Powering Fast Data and the Hadoop Ecosystem with VoltDB and Hortonworks
Powering Fast Data and the Hadoop Ecosystem with VoltDB and HortonworksPowering Fast Data and the Hadoop Ecosystem with VoltDB and Hortonworks
Powering Fast Data and the Hadoop Ecosystem with VoltDB and Hortonworks
 
Splunk-hortonworks-risk-management-oct-2014
Splunk-hortonworks-risk-management-oct-2014Splunk-hortonworks-risk-management-oct-2014
Splunk-hortonworks-risk-management-oct-2014
 
Hortonworks on IBM POWER Analytics / AI
Hortonworks on IBM POWER Analytics / AIHortonworks on IBM POWER Analytics / AI
Hortonworks on IBM POWER Analytics / AI
 
Hadoop: The Unintended Benefits
Hadoop: The Unintended BenefitsHadoop: The Unintended Benefits
Hadoop: The Unintended Benefits
 
Containers and Big Data
Containers and Big DataContainers and Big Data
Containers and Big Data
 
Containers and Big Data
Containers and Big Data Containers and Big Data
Containers and Big Data
 
Data in the Cloud Crash Course
Data in the Cloud Crash CourseData in the Cloud Crash Course
Data in the Cloud Crash Course
 
Big Data Expo 2015 - Hortonworks Common Hadoop Use Cases
Big Data Expo 2015 - Hortonworks Common Hadoop Use CasesBig Data Expo 2015 - Hortonworks Common Hadoop Use Cases
Big Data Expo 2015 - Hortonworks Common Hadoop Use Cases
 
Enabling the Real Time Analytical Enterprise
Enabling the Real Time Analytical EnterpriseEnabling the Real Time Analytical Enterprise
Enabling the Real Time Analytical Enterprise
 
Fast SQL on Hadoop, really?
Fast SQL on Hadoop, really?Fast SQL on Hadoop, really?
Fast SQL on Hadoop, really?
 
10 Lessons Learned from Meeting with 150 Banks Across the Globe
10 Lessons Learned from Meeting with 150 Banks Across the Globe10 Lessons Learned from Meeting with 150 Banks Across the Globe
10 Lessons Learned from Meeting with 150 Banks Across the Globe
 
Apache Hadoop YARN: state of the union
Apache Hadoop YARN: state of the unionApache Hadoop YARN: state of the union
Apache Hadoop YARN: state of the union
 
Big Data Platform Processes Daily Healthcare Data for Clinic Use at Mayo Clinic
Big Data Platform Processes Daily Healthcare Data for Clinic Use at Mayo ClinicBig Data Platform Processes Daily Healthcare Data for Clinic Use at Mayo Clinic
Big Data Platform Processes Daily Healthcare Data for Clinic Use at Mayo Clinic
 
Running Enterprise Workloads with an open source Hybrid Cloud Data Architectu...
Running Enterprise Workloads with an open source Hybrid Cloud Data Architectu...Running Enterprise Workloads with an open source Hybrid Cloud Data Architectu...
Running Enterprise Workloads with an open source Hybrid Cloud Data Architectu...
 
YARN Ready: Integrating to YARN with Tez
YARN Ready: Integrating to YARN with Tez YARN Ready: Integrating to YARN with Tez
YARN Ready: Integrating to YARN with Tez
 
Hortonworks Data In Motion Webinar Series Pt. 2
Hortonworks Data In Motion Webinar Series Pt. 2Hortonworks Data In Motion Webinar Series Pt. 2
Hortonworks Data In Motion Webinar Series Pt. 2
 
Unlock Value from Big Data with Apache NiFi and Streaming CDC
Unlock Value from Big Data with Apache NiFi and Streaming CDCUnlock Value from Big Data with Apache NiFi and Streaming CDC
Unlock Value from Big Data with Apache NiFi and Streaming CDC
 
Discover HDP 2.1: Apache Hadoop 2.4.0, YARN & HDFS
Discover HDP 2.1: Apache Hadoop 2.4.0, YARN & HDFSDiscover HDP 2.1: Apache Hadoop 2.4.0, YARN & HDFS
Discover HDP 2.1: Apache Hadoop 2.4.0, YARN & HDFS
 
Hortonworks DataFlow (HDF) 3.3 - Taking Stream Processing to the Next Level
Hortonworks DataFlow (HDF) 3.3 - Taking Stream Processing to the Next LevelHortonworks DataFlow (HDF) 3.3 - Taking Stream Processing to the Next Level
Hortonworks DataFlow (HDF) 3.3 - Taking Stream Processing to the Next Level
 

Similar to Hortonworks HDP, Is it goog enough ?

Transform You Business with Big Data and Hortonworks
Transform You Business with Big Data and HortonworksTransform You Business with Big Data and Hortonworks
Transform You Business with Big Data and Hortonworks
Hortonworks
 
Transform Your Business with Big Data and Hortonworks
Transform Your Business with Big Data and Hortonworks Transform Your Business with Big Data and Hortonworks
Transform Your Business with Big Data and Hortonworks
Pactera_US
 
Hortonworks Hadoop @ Oslo Hadoop User Group
Hortonworks Hadoop @ Oslo Hadoop User GroupHortonworks Hadoop @ Oslo Hadoop User Group
Hortonworks Hadoop @ Oslo Hadoop User Group
Mats Johansson
 
Meetup oslo hortonworks HDP
Meetup oslo hortonworks HDPMeetup oslo hortonworks HDP
Meetup oslo hortonworks HDP
Alexander Bakos Leirvåg
 
Open-BDA Hadoop Summit 2014 - Mr. Slim Baltagi (Building a Modern Data Archit...
Open-BDA Hadoop Summit 2014 - Mr. Slim Baltagi (Building a Modern Data Archit...Open-BDA Hadoop Summit 2014 - Mr. Slim Baltagi (Building a Modern Data Archit...
Open-BDA Hadoop Summit 2014 - Mr. Slim Baltagi (Building a Modern Data Archit...
Innovative Management Services
 
Hadoop jon
Hadoop jonHadoop jon
Hadoop jon
Humoyun Ahmedov
 
Delivering a Flexible IT Infrastructure for Analytics on IBM Power Systems
Delivering a Flexible IT Infrastructure for Analytics on IBM Power SystemsDelivering a Flexible IT Infrastructure for Analytics on IBM Power Systems
Delivering a Flexible IT Infrastructure for Analytics on IBM Power Systems
Hortonworks
 
201305 hadoop jpl-v3
201305 hadoop jpl-v3201305 hadoop jpl-v3
201305 hadoop jpl-v3
Eric Baldeschwieler
 
Hortonworks - What's Possible with a Modern Data Architecture?
Hortonworks - What's Possible with a Modern Data Architecture?Hortonworks - What's Possible with a Modern Data Architecture?
Hortonworks - What's Possible with a Modern Data Architecture?
Hortonworks
 
Open source stak of big data techs open suse asia
Open source stak of big data techs   open suse asiaOpen source stak of big data techs   open suse asia
Open source stak of big data techs open suse asia
Muhammad Rifqi
 
Architecting the Future of Big Data and Search
Architecting the Future of Big Data and SearchArchitecting the Future of Big Data and Search
Architecting the Future of Big Data and Search
Hortonworks
 
Accelerate Analytics and ML in the Hybrid Cloud Era
Accelerate Analytics and ML in the Hybrid Cloud EraAccelerate Analytics and ML in the Hybrid Cloud Era
Accelerate Analytics and ML in the Hybrid Cloud Era
Alluxio, Inc.
 
Rescue your Big Data from Downtime with HP Operations Bridge and Apache Hadoop
Rescue your Big Data from Downtime with HP Operations Bridge and Apache HadoopRescue your Big Data from Downtime with HP Operations Bridge and Apache Hadoop
Rescue your Big Data from Downtime with HP Operations Bridge and Apache Hadoop
Hortonworks
 
Hadoop_Its_Not_Just_Internal_Storage_V14
Hadoop_Its_Not_Just_Internal_Storage_V14Hadoop_Its_Not_Just_Internal_Storage_V14
Hadoop_Its_Not_Just_Internal_Storage_V14
John Sing
 
Introduction to Hadoop
Introduction to HadoopIntroduction to Hadoop
Introduction to Hadoop
Vigen Sahakyan
 
A Comprehensive Approach to Building your Big Data - with Cisco, Hortonworks ...
A Comprehensive Approach to Building your Big Data - with Cisco, Hortonworks ...A Comprehensive Approach to Building your Big Data - with Cisco, Hortonworks ...
A Comprehensive Approach to Building your Big Data - with Cisco, Hortonworks ...
Hortonworks
 
BIG Data & Hadoop Applications in Social Media
BIG Data & Hadoop Applications in Social MediaBIG Data & Hadoop Applications in Social Media
BIG Data & Hadoop Applications in Social Media
Skillspeed
 
Introduction-to-Big-Data-and-Hadoop.pptx
Introduction-to-Big-Data-and-Hadoop.pptxIntroduction-to-Big-Data-and-Hadoop.pptx
Introduction-to-Big-Data-and-Hadoop.pptx
Pratimakumari213460
 
Introduction to Hadoop
Introduction to HadoopIntroduction to Hadoop
Introduction to Hadoop
POSSCON
 
Big Data/Hadoop Option Analysis
Big Data/Hadoop Option AnalysisBig Data/Hadoop Option Analysis
Big Data/Hadoop Option Analysis
zafarali1981
 

Similar to Hortonworks HDP, Is it goog enough ? (20)

Transform You Business with Big Data and Hortonworks
Transform You Business with Big Data and HortonworksTransform You Business with Big Data and Hortonworks
Transform You Business with Big Data and Hortonworks
 
Transform Your Business with Big Data and Hortonworks
Transform Your Business with Big Data and Hortonworks Transform Your Business with Big Data and Hortonworks
Transform Your Business with Big Data and Hortonworks
 
Hortonworks Hadoop @ Oslo Hadoop User Group
Hortonworks Hadoop @ Oslo Hadoop User GroupHortonworks Hadoop @ Oslo Hadoop User Group
Hortonworks Hadoop @ Oslo Hadoop User Group
 
Meetup oslo hortonworks HDP
Meetup oslo hortonworks HDPMeetup oslo hortonworks HDP
Meetup oslo hortonworks HDP
 
Open-BDA Hadoop Summit 2014 - Mr. Slim Baltagi (Building a Modern Data Archit...
Open-BDA Hadoop Summit 2014 - Mr. Slim Baltagi (Building a Modern Data Archit...Open-BDA Hadoop Summit 2014 - Mr. Slim Baltagi (Building a Modern Data Archit...
Open-BDA Hadoop Summit 2014 - Mr. Slim Baltagi (Building a Modern Data Archit...
 
Hadoop jon
Hadoop jonHadoop jon
Hadoop jon
 
Delivering a Flexible IT Infrastructure for Analytics on IBM Power Systems
Delivering a Flexible IT Infrastructure for Analytics on IBM Power SystemsDelivering a Flexible IT Infrastructure for Analytics on IBM Power Systems
Delivering a Flexible IT Infrastructure for Analytics on IBM Power Systems
 
201305 hadoop jpl-v3
201305 hadoop jpl-v3201305 hadoop jpl-v3
201305 hadoop jpl-v3
 
Hortonworks - What's Possible with a Modern Data Architecture?
Hortonworks - What's Possible with a Modern Data Architecture?Hortonworks - What's Possible with a Modern Data Architecture?
Hortonworks - What's Possible with a Modern Data Architecture?
 
Open source stak of big data techs open suse asia
Open source stak of big data techs   open suse asiaOpen source stak of big data techs   open suse asia
Open source stak of big data techs open suse asia
 
Architecting the Future of Big Data and Search
Architecting the Future of Big Data and SearchArchitecting the Future of Big Data and Search
Architecting the Future of Big Data and Search
 
Accelerate Analytics and ML in the Hybrid Cloud Era
Accelerate Analytics and ML in the Hybrid Cloud EraAccelerate Analytics and ML in the Hybrid Cloud Era
Accelerate Analytics and ML in the Hybrid Cloud Era
 
Rescue your Big Data from Downtime with HP Operations Bridge and Apache Hadoop
Rescue your Big Data from Downtime with HP Operations Bridge and Apache HadoopRescue your Big Data from Downtime with HP Operations Bridge and Apache Hadoop
Rescue your Big Data from Downtime with HP Operations Bridge and Apache Hadoop
 
Hadoop_Its_Not_Just_Internal_Storage_V14
Hadoop_Its_Not_Just_Internal_Storage_V14Hadoop_Its_Not_Just_Internal_Storage_V14
Hadoop_Its_Not_Just_Internal_Storage_V14
 
Introduction to Hadoop
Introduction to HadoopIntroduction to Hadoop
Introduction to Hadoop
 
A Comprehensive Approach to Building your Big Data - with Cisco, Hortonworks ...
A Comprehensive Approach to Building your Big Data - with Cisco, Hortonworks ...A Comprehensive Approach to Building your Big Data - with Cisco, Hortonworks ...
A Comprehensive Approach to Building your Big Data - with Cisco, Hortonworks ...
 
BIG Data & Hadoop Applications in Social Media
BIG Data & Hadoop Applications in Social MediaBIG Data & Hadoop Applications in Social Media
BIG Data & Hadoop Applications in Social Media
 
Introduction-to-Big-Data-and-Hadoop.pptx
Introduction-to-Big-Data-and-Hadoop.pptxIntroduction-to-Big-Data-and-Hadoop.pptx
Introduction-to-Big-Data-and-Hadoop.pptx
 
Introduction to Hadoop
Introduction to HadoopIntroduction to Hadoop
Introduction to Hadoop
 
Big Data/Hadoop Option Analysis
Big Data/Hadoop Option AnalysisBig Data/Hadoop Option Analysis
Big Data/Hadoop Option Analysis
 

Recently uploaded

Infrastructure Challenges in Scaling RAG with Custom AI models
Infrastructure Challenges in Scaling RAG with Custom AI modelsInfrastructure Challenges in Scaling RAG with Custom AI models
Infrastructure Challenges in Scaling RAG with Custom AI models
Zilliz
 
Video Streaming: Then, Now, and in the Future
Video Streaming: Then, Now, and in the FutureVideo Streaming: Then, Now, and in the Future
Video Streaming: Then, Now, and in the Future
Alpen-Adria-Universität
 
AI 101: An Introduction to the Basics and Impact of Artificial Intelligence
AI 101: An Introduction to the Basics and Impact of Artificial IntelligenceAI 101: An Introduction to the Basics and Impact of Artificial Intelligence
AI 101: An Introduction to the Basics and Impact of Artificial Intelligence
IndexBug
 
GenAI Pilot Implementation in the organizations
GenAI Pilot Implementation in the organizationsGenAI Pilot Implementation in the organizations
GenAI Pilot Implementation in the organizations
kumardaparthi1024
 
20240605 QFM017 Machine Intelligence Reading List May 2024
20240605 QFM017 Machine Intelligence Reading List May 202420240605 QFM017 Machine Intelligence Reading List May 2024
20240605 QFM017 Machine Intelligence Reading List May 2024
Matthew Sinclair
 
How to use Firebase Data Connect For Flutter
How to use Firebase Data Connect For FlutterHow to use Firebase Data Connect For Flutter
How to use Firebase Data Connect For Flutter
Daiki Mogmet Ito
 
Communications Mining Series - Zero to Hero - Session 1
Communications Mining Series - Zero to Hero - Session 1Communications Mining Series - Zero to Hero - Session 1
Communications Mining Series - Zero to Hero - Session 1
DianaGray10
 
Let's Integrate MuleSoft RPA, COMPOSER, APM with AWS IDP along with Slack
Let's Integrate MuleSoft RPA, COMPOSER, APM with AWS IDP along with SlackLet's Integrate MuleSoft RPA, COMPOSER, APM with AWS IDP along with Slack
Let's Integrate MuleSoft RPA, COMPOSER, APM with AWS IDP along with Slack
shyamraj55
 
GraphSummit Singapore | The Future of Agility: Supercharging Digital Transfor...
GraphSummit Singapore | The Future of Agility: Supercharging Digital Transfor...GraphSummit Singapore | The Future of Agility: Supercharging Digital Transfor...
GraphSummit Singapore | The Future of Agility: Supercharging Digital Transfor...
Neo4j
 
Mariano G Tinti - Decoding SpaceX
Mariano G Tinti - Decoding SpaceXMariano G Tinti - Decoding SpaceX
Mariano G Tinti - Decoding SpaceX
Mariano Tinti
 
“I’m still / I’m still / Chaining from the Block”
“I’m still / I’m still / Chaining from the Block”“I’m still / I’m still / Chaining from the Block”
“I’m still / I’m still / Chaining from the Block”
Claudio Di Ciccio
 
Essentials of Automations: The Art of Triggers and Actions in FME
Essentials of Automations: The Art of Triggers and Actions in FMEEssentials of Automations: The Art of Triggers and Actions in FME
Essentials of Automations: The Art of Triggers and Actions in FME
Safe Software
 
Introduction to CHERI technology - Cybersecurity
Introduction to CHERI technology - CybersecurityIntroduction to CHERI technology - Cybersecurity
Introduction to CHERI technology - Cybersecurity
mikeeftimakis1
 
HCL Notes und Domino Lizenzkostenreduzierung in der Welt von DLAU
HCL Notes und Domino Lizenzkostenreduzierung in der Welt von DLAUHCL Notes und Domino Lizenzkostenreduzierung in der Welt von DLAU
HCL Notes und Domino Lizenzkostenreduzierung in der Welt von DLAU
panagenda
 
20240609 QFM020 Irresponsible AI Reading List May 2024
20240609 QFM020 Irresponsible AI Reading List May 202420240609 QFM020 Irresponsible AI Reading List May 2024
20240609 QFM020 Irresponsible AI Reading List May 2024
Matthew Sinclair
 
Artificial Intelligence for XMLDevelopment
Artificial Intelligence for XMLDevelopmentArtificial Intelligence for XMLDevelopment
Artificial Intelligence for XMLDevelopment
Octavian Nadolu
 
Building Production Ready Search Pipelines with Spark and Milvus
Building Production Ready Search Pipelines with Spark and MilvusBuilding Production Ready Search Pipelines with Spark and Milvus
Building Production Ready Search Pipelines with Spark and Milvus
Zilliz
 
Unlock the Future of Search with MongoDB Atlas_ Vector Search Unleashed.pdf
Unlock the Future of Search with MongoDB Atlas_ Vector Search Unleashed.pdfUnlock the Future of Search with MongoDB Atlas_ Vector Search Unleashed.pdf
Unlock the Future of Search with MongoDB Atlas_ Vector Search Unleashed.pdf
Malak Abu Hammad
 
Why You Should Replace Windows 11 with Nitrux Linux 3.5.0 for enhanced perfor...
Why You Should Replace Windows 11 with Nitrux Linux 3.5.0 for enhanced perfor...Why You Should Replace Windows 11 with Nitrux Linux 3.5.0 for enhanced perfor...
Why You Should Replace Windows 11 with Nitrux Linux 3.5.0 for enhanced perfor...
SOFTTECHHUB
 
Climate Impact of Software Testing at Nordic Testing Days
Climate Impact of Software Testing at Nordic Testing DaysClimate Impact of Software Testing at Nordic Testing Days
Climate Impact of Software Testing at Nordic Testing Days
Kari Kakkonen
 

Recently uploaded (20)

Infrastructure Challenges in Scaling RAG with Custom AI models
Infrastructure Challenges in Scaling RAG with Custom AI modelsInfrastructure Challenges in Scaling RAG with Custom AI models
Infrastructure Challenges in Scaling RAG with Custom AI models
 
Video Streaming: Then, Now, and in the Future
Video Streaming: Then, Now, and in the FutureVideo Streaming: Then, Now, and in the Future
Video Streaming: Then, Now, and in the Future
 
AI 101: An Introduction to the Basics and Impact of Artificial Intelligence
AI 101: An Introduction to the Basics and Impact of Artificial IntelligenceAI 101: An Introduction to the Basics and Impact of Artificial Intelligence
AI 101: An Introduction to the Basics and Impact of Artificial Intelligence
 
GenAI Pilot Implementation in the organizations
GenAI Pilot Implementation in the organizationsGenAI Pilot Implementation in the organizations
GenAI Pilot Implementation in the organizations
 
20240605 QFM017 Machine Intelligence Reading List May 2024
20240605 QFM017 Machine Intelligence Reading List May 202420240605 QFM017 Machine Intelligence Reading List May 2024
20240605 QFM017 Machine Intelligence Reading List May 2024
 
How to use Firebase Data Connect For Flutter
How to use Firebase Data Connect For FlutterHow to use Firebase Data Connect For Flutter
How to use Firebase Data Connect For Flutter
 
Communications Mining Series - Zero to Hero - Session 1
Communications Mining Series - Zero to Hero - Session 1Communications Mining Series - Zero to Hero - Session 1
Communications Mining Series - Zero to Hero - Session 1
 
Let's Integrate MuleSoft RPA, COMPOSER, APM with AWS IDP along with Slack
Let's Integrate MuleSoft RPA, COMPOSER, APM with AWS IDP along with SlackLet's Integrate MuleSoft RPA, COMPOSER, APM with AWS IDP along with Slack
Let's Integrate MuleSoft RPA, COMPOSER, APM with AWS IDP along with Slack
 
GraphSummit Singapore | The Future of Agility: Supercharging Digital Transfor...
GraphSummit Singapore | The Future of Agility: Supercharging Digital Transfor...GraphSummit Singapore | The Future of Agility: Supercharging Digital Transfor...
GraphSummit Singapore | The Future of Agility: Supercharging Digital Transfor...
 
Mariano G Tinti - Decoding SpaceX
Mariano G Tinti - Decoding SpaceXMariano G Tinti - Decoding SpaceX
Mariano G Tinti - Decoding SpaceX
 
“I’m still / I’m still / Chaining from the Block”
“I’m still / I’m still / Chaining from the Block”“I’m still / I’m still / Chaining from the Block”
“I’m still / I’m still / Chaining from the Block”
 
Essentials of Automations: The Art of Triggers and Actions in FME
Essentials of Automations: The Art of Triggers and Actions in FMEEssentials of Automations: The Art of Triggers and Actions in FME
Essentials of Automations: The Art of Triggers and Actions in FME
 
Introduction to CHERI technology - Cybersecurity
Introduction to CHERI technology - CybersecurityIntroduction to CHERI technology - Cybersecurity
Introduction to CHERI technology - Cybersecurity
 
HCL Notes und Domino Lizenzkostenreduzierung in der Welt von DLAU
HCL Notes und Domino Lizenzkostenreduzierung in der Welt von DLAUHCL Notes und Domino Lizenzkostenreduzierung in der Welt von DLAU
HCL Notes und Domino Lizenzkostenreduzierung in der Welt von DLAU
 
20240609 QFM020 Irresponsible AI Reading List May 2024
20240609 QFM020 Irresponsible AI Reading List May 202420240609 QFM020 Irresponsible AI Reading List May 2024
20240609 QFM020 Irresponsible AI Reading List May 2024
 
Artificial Intelligence for XMLDevelopment
Artificial Intelligence for XMLDevelopmentArtificial Intelligence for XMLDevelopment
Artificial Intelligence for XMLDevelopment
 
Building Production Ready Search Pipelines with Spark and Milvus
Building Production Ready Search Pipelines with Spark and MilvusBuilding Production Ready Search Pipelines with Spark and Milvus
Building Production Ready Search Pipelines with Spark and Milvus
 
Unlock the Future of Search with MongoDB Atlas_ Vector Search Unleashed.pdf
Unlock the Future of Search with MongoDB Atlas_ Vector Search Unleashed.pdfUnlock the Future of Search with MongoDB Atlas_ Vector Search Unleashed.pdf
Unlock the Future of Search with MongoDB Atlas_ Vector Search Unleashed.pdf
 
Why You Should Replace Windows 11 with Nitrux Linux 3.5.0 for enhanced perfor...
Why You Should Replace Windows 11 with Nitrux Linux 3.5.0 for enhanced perfor...Why You Should Replace Windows 11 with Nitrux Linux 3.5.0 for enhanced perfor...
Why You Should Replace Windows 11 with Nitrux Linux 3.5.0 for enhanced perfor...
 
Climate Impact of Software Testing at Nordic Testing Days
Climate Impact of Software Testing at Nordic Testing DaysClimate Impact of Software Testing at Nordic Testing Days
Climate Impact of Software Testing at Nordic Testing Days
 

Hortonworks HDP, Is it goog enough ?

  • 1. © Huxi LI, 2018 HADOOP BIG DATA PLATFORM Hortonworks HDP, Is it good enough ?
  • 2. © Huxi LI, 2018 AGENDA 1|Why Hadoop ? 2|Hortonworks HDP, is it good enough ? 3|Hadoop cluster provision with Ambari 4|Demos of typical use cases Annexes 2
  • 3. © Huxi LI, 2018 Digitalization INDUSTRIAL TRENDS • Digitalization of core business: Going digital is now a core strategy for many organizations around the world, including financial et insurance core business. Most successful enterprises deliver their core business through digital channels. • Digitalization of society: Social networks, collaborative platforms, e- government, and smart devices are changing the entire society. - Social network is everywhere and start impacting the business model; - Industrialized countries are pushing the digitalization of their entire public services as a way of modernizing governments and better serving their citizens. • Internet Of Thing (IoT): The internet are expanding to physical objects; experts estimate that the IoT will consist of about 30 billion objects by 2020 (IEEE, 2016). • Artificial Intelligence (AI): The maturity of AI reaches another level, from academic experimental stage into massive practical usage era. AI not only need massive data and but also produce massive data. 3 A matter of Data
  • 4. © Huxi LI, 2018 AGE OF BIG DATA 4 Big Volume High Variety Uncertain Veracity High Velocity In 2016, global mobile data traffic amounted to 7 exabytes per month. In 2021, mobile data traffic worldwide is expected to reach 49 exabytes per month at a compound annual growth rate of 47 percent (Statista, 2018) Variety of data can be numerous - structured, semi-structured and mostly unstructured data as well. Velocity of data producing can be variable from slow batch processing, near real-time streaming, to real-time IoT sensors. Veracity is one of the unfortunate characteristics of today’s data. Not all data are trustworthy.
  • 5. © Huxi LI, 2018 HADOOP = DESIGNED FOR BIG DATA Hadoop allows for the distributed processing of large data sets across clusters of computers using simple programming models. It is designed to scale up to thousands of machines, each offering local computation and storage. • Massive Storage (Volume): Hadoop’s distributed file system (HDFS), is specially designed to store massive data on commodity hardware, providing high-throughput access to application data. • Any kind of Data (Variety): You can save text, image, video, XML, JSON and any other type of data in HDFS, as if you are working on a unix-like file system - this is very important because of the variety of today’s data. • Massively Scalable Computation: Parallel computing is integrated into Hadoop core system – YARN for cluster resource management and MapReduce for parallel processing of large data sets. And many other derived tools such as Storm, Hive, and so on you can choose from. 5
  • 6. © Huxi LI, 2018 HADOOP = A MATURE SYSTEM • Production-proved: Hadoop has been confirmed in production by many companies around the world. • Available on Cloud: Hadoop are now deployed on all major cloud platforms such Microsoft AZURE, Amazon AWS, IBM Cloud, and Google GCP. 6
  • 7. © Huxi LI, 2018 HADOOP = A RICH ECOSYSTEM The term Hadoop refers to the entire ecosystem or collection of software packages that can be installed on top of or alongside Hadoop. Many choices : • Ingestion : Scoop, Flume, Kafka, NIFI, Storm, Flink, … • Streaming & Compute: MapReduce, Storm, Spark, Flink, Nifi, … • Analysis/SQL-on-Hadoop: Impala, Hive, Drill, … • Machine learning: Mahout, Spark ML, … • NoSQL: HBase, Cassandra, … • Management: Ambari, HUE, … 7
  • 8. © Huxi LI, 2018 VALUE PROPOSITIONS OF HADOOP 8 Designed for Big Data • Massively scalable storage and computation • Carefully designed for large scale deployment High Maturity • Production-proved, highly robust • Confirmed by big clusters with thousands of nodes Rich Ecosystem • Ecosystem formed around Hadoop with many choices, with enterprise-grade quality but freely available • Wide supports, commercially or of community. High Scalability • Technically, a single Hadoop cluster can be scaled to thousands of nodes • Functionally, Hadoop support any kind of data, scalable to a wide range of use cases.
  • 9. © Huxi LI, 2018 AGENDA 1|Why Hadoop ? 2|Hortonworks HDP, is it good enough ? 3|Hadoop cluster provision with Ambari 4|Demos of typical use cases Annexes 9
  • 10. © Huxi LI, 2018 “Enterprises like Hortonworks’ storage and compute processing, broad data ingestion, data governance, and open source support when deploying BDW.” - Forrester 2017 HDP = ENTERPRISE-GRADE HADOOP + 100% OPEN SOURCE Enterprise-grade Hadoop Hortonworks HDP is the only Enterprise-Grade Hadoop distribution of fully Open Source Apache License All of the technology built into the Hortonworks Data Platform is an Apache open source project 100% Open Source Hortonworks HDP is the only Enterprise-Grade Hadoop distribution of fully Open Source No Vendor Lock-In HDP delivers enterprise-grade software that fosters innovation and prevents vendor lock-in HDP
  • 11. © Huxi LI, 2018 HDP = CAREFULLY ENGINEERED SECURITY ARCHITECTURE • Platform built-in security and data governance. Built-In Security • Support integration into enterprise’s existing security system with support of industrial standards, such as Kerberos and LDAP. Corporate Integration • Centralized cluster management, monitoring, and access control. Centralized control • Simple yet powerful, Unix-like, resource permission control (HDFS) Simplicity
  • 12. © Huxi LI, 2018 HDP = LEADER OF BIG DATA WAREHOUSE • Hortonworks was ranked as a leader in The Forrester Wave™: Big Data Warehouse (Q2 2017). 12 “Hortonworks delivers actionable intelligence from all kinds of data-in-motion and data-at-rest. Through its open source strategy, Hortonworks continually evolves its offering by working closely with partners across the EDW ecosystem of tools and vendors. The vendor provides a cost-effective, nimble, and scalable architecture to implement big data warehouses, whether on-premises or in the cloud.” - Forrester 2017
  • 13. © Huxi LI, 2018 HORTONWORKS HDP 13
  • 14. © Huxi LI, 2018 VALUE PROPOSITIONS OF HORTONWORKS HDP Cost • No license fee • Pay for support Openness • Open source • With commercial friendly Apache license Expertise • Key contributors of Hadoop • Decade of industrial Big Data experience Security • Guarantee of a market leader • Mature technology, Support on premise, cloud, or appliance.
  • 15. © Huxi LI, 2018 AGENDA 1|Why Hadoop ? 2|Hortonworks HDP, is it good for me ? 3|Hadoop Cluster Provision with Ambari 4|Demos of typical use cases Annexes 15
  • 16. © Huxi LI, 2018 AMBARI HADOOP CLUSTER PROVISION Explore Hadoop Cluster With Ambari Create Hadoop Cluster With Ambari Install Ambari server Prepare VMs Ambari 1 2 3 4 Manual, Ansible, Puppet …
  • 17. © Huxi LI, 2018 TYPICAL HADOOP CLUSTERS • Hadoop backbone: The Hadoop backbone is typically composed of HDFS, MapReduce and YARN, i.e. the core of Hadoop cluster: • HDFS for massive storage service • MapReduce for parallel computing of datasets • YARN for job scheduling or resource Mgt • Edge nodes: Edge nodes are part of the Hadoop cluster, but they are typically connected to the rest of the corporate network, providing operational environments to end users. • Ambari is a typical edge node providing cluster management service • Scoop & Flume are another kind of edge nodes providing dataflow IN or OUT of the cluster • Spark & Hive are edge nodes serving applications or business users 17 Corporate Network Administration Applications Edge nodes (e.g. Hive) Management (e.g. Ambari)
  • 18. © Huxi LI, 2018 A PROOF-OF-CONCEPT HADOOP CLUSTER • POC Hadoop cluster: • Composed of 3 nodes, • Share VMs for edge and backbone nodes • Hortonworks HDP 2.6.1.3 • AWS VPC 172.31.0.0/16 • Provisioner: • Ambari as provisioner of the cluster 18 VPC: 172.31.0.0/16 hdp-master (172.31.45.112) hdp-slave1 (172.31.37.152) hdp-slave2 (172.31.45.181) Namenode, Ambari, etc. Datanode, Hive, Storm, etc. Datanode, Nodemanager, etc.
  • 19. © Huxi LI, 2018 MACHINE SETUP • Provision VMs with AWS: • 3 EC2 instances • Redhat 7.0 Linux • 1 VPC allowing traffic among nodes 19
  • 20. © Huxi LI, 2018 NETWORK TRAFFIC SETUP • Authorized traffic (AWS): 20 Cluster internal traffic
  • 21. © Huxi LI, 2018 INSTALL AMBARI SERVER • Follow the official installation guide for details 21
  • 22. © Huxi LI, 2018 SETUP HADOOP CLUSTER WITH AMBARI 22 Click Here Ambari Web UI: http://hdp-master.huxili.com:8080
  • 23. © Huxi LI, 2018 SETUP HADOOP CLUSTER WITH AMBARI 23 Choose Hadoop nodes
  • 24. © Huxi LI, 2018 EXPLORE HADOOP CLUSTER WITH AMBARI 24
  • 25. © Huxi LI, 2018 AGENDA 1|Why Hadoop ? 2|Hortonworks HDP, is it good for me ? 3|Hadoop Cluster Provision with Ambari 4|Demos of typical use cases Annexes 25
  • 26. © Huxi LI, 201826 1. Data storage in HDFS …
  • 27. © Huxi LI, 2018 DEMO 1 - SAVING DATA IN HDFS Ambari UI Upload Files Create directory File View • Open Ambari UI at http://hdp-master.huxili.com Manage with Ambari • Open File View from Ambari UI • Create a directory in HDFS : /user/huxili/dfs_demo • Manipulate uploaded files (rename, permission, Ctrl+Mouse => Select directory) Set permission • Upload files into the above create directory
  • 28. © Huxi LI, 201828 2. JSON Data query with Hive …
  • 29. © Huxi LI, 2018 DEMO 2 – DATA QUERY WITH HIVE Ambari UI Create table for JSON data Create schema for DEMO Hive View • Check sample twitter data (JSON) in /user/huxili/twitter_data Hive Query with Ambari • Open Ambari Hive View 2.0 • Create a database for demo purpose -> create database demo; • Query JSON data in Hive Query JSON Data in Hive • Create an external table demo_tweets pointing to /user/huxili/twitter_data
  • 30. © Huxi LI, 201830 3. Data Streaming with Storm …
  • 31. © Huxi LI, 2018 DEMO 3 – STREAMING OF TWEETS WITH STORM 31 StreamingSpout Tweets Splitter Bolt Filter Bolt Counter Bolt Aggregation & HDFS writing Bolt Hadoop HDFS (File: /user/huxili/storm_out/counts.txt)
  • 32. © Huxi LI, 2018 DEMO 3 – DATA STREAMING WITH STORM Ambari UI SSH into Storm Host Upload Streaming Example File View • Open Ambari File View -> Upload example streaming jar Explore Results in Ambari • SSH into hdp-slave1 (Storm host) • su root & download the streaming example (hdfs dfs -get /user/huxili/storm_apps/tw*.jar) • Streaming will stop after 30 seconds and explore results in Ambari Submit streaming App to Storm • Submit the streaming to storm
  • 33. © Huxi LI, 201833 Annexes
  • 34. © Huxi LI, 2018 ANNEXES A1|Ambari Server Setup • Provision VMs on AWS • Install required packages • Install Ambari Server A2|Hadoop Cluster Setup A3|Streaming Architecture Hive/Storm 34
  • 35. © Huxi LI, 2018 • Provision VMs: • Using AWS EC2 instance • Redhat 7.0 Linux • AWS / VPC, 3 nodes A1 - AMBARI SERVER SETUP 35 MACHINES SETUP
  • 36. © Huxi LI, 2018 • Updating /etc/hosts (All nodes): • 172.31.45.112 hdp-master.huxili.com • 172.31.37.152 hdp-slave1.huxili.com • 172.31.45.181 hdp-slave2.huxili.com • Updating /etc/hostname (All node): • hdp-master.huxili.com (Master) • hdp-slave1.huxili.com (slave1) • hdp-slave2.huxili.com (slave2) A1 - AMBARI SERVER SETUP 36 MACHINES SETUP
  • 37. © Huxi LI, 2018 • Authorized traffic (AWS): A1 - AMBARI SERVER SETUP 37 NETWORK TRAFFIC SETUP Cluster internal traffic
  • 38. © Huxi LI, 2018 ANNEXES A1|Ambari Server Setup • Provision VMs on AWS • Install required packages • Install Ambari Server A2|Hadoop Cluster Setup A3|Streaming Architecture Hive/Storm 38
  • 39. © Huxi LI, 2018 • Hdp-master.huxli.com: • sudo yum install -y ntp • sudo systemctl enable ntpd • sudo yum install -y postgresql-jdbc • sudo yum install -y wget • sudo wget -nv http://public-repo-1.hortonworks.com/ambari/centos7/2.x/updates/2.6.1.3/ambari.repo -O /etc/yum.repos.d/ambari.repo • sudo yum install -y ambari-server A1 - AMBARI SERVER SETUP 39 INSTALL REQUIRED PACKAGES
  • 40. © Huxi LI, 2018 • hdp-slave1.huxili.com & hdp-slave2.huxili.com: • sudo yum install -y ntp • sudo systemctl enable ntpd A1 - AMBARI SERVER SETUP 40 INSTALL REQUIRED PACKAGES
  • 41. © Huxi LI, 2018 A1 - AMBARI SERVER SETUP 41 INSTALL REQUIRED PACKAGES • hdp-slave1.huxili.com: Install hive metastore DB (postgres)
  • 42. © Huxi LI, 2018 A1 - AMBARI SERVER SETUP 42 INSTALL REQUIRED PACKAGES • hdp-slave1.huxili.com: Check Hive DB
  • 43. © Huxi LI, 2018 ANNEXES A1|Ambari Server Setup • Provision VMs on AWS • Install required packages • Install Ambari Server A2|Setup Hadoop Cluster A3|Streaming Architecture Hive/Storm 43
  • 44. © Huxi LI, 2018 • Commands to run: A1 - AMBARI SERVER SETUP 44 INSTALL SERVER
  • 45. © Huxi LI, 2018 • Configure JDBC driver: • sudo ambari-server setup --jdbc-db=postgres --jdbc-driver=/usr/share/java/postgresql-jdbc.jar A1 - AMBARI SERVER SETUP 45 INSTALL SERVER
  • 46. © Huxi LI, 2018 • Setup server: • sudo ambary-server setup A1 - AMBARI SERVER SETUP 46 INSTALL SERVER
  • 47. © Huxi LI, 2018 • Setup server (continued): A1 - AMBARI SERVER SETUP 47 INSTALL SERVER
  • 48. © Huxi LI, 2018 • Start server: A1 - AMBARI SERVER SETUP 48 INSTALL SERVER
  • 49. © Huxi LI, 2018 • Check services binding: A1 - AMBARI SERVER SETUP 49 INSTALL SERVER
  • 50. © Huxi LI, 2018 ANNEXES A1|Ambari Server Setup A2|Hadoop Cluster Setup • Install Hadoop components • Check cluster sanity A3|Streaming Architecture Hive/Storm 50
  • 51. © Huxi LI, 2018 • Open Ambari management UI: • http://hdp-master.huxili.com:8080 • User: admin • Pwd: admin A2 - HADOOP CLUSTER SETUP 51 STEP 1 – OPEN MANAGEMENT UI
  • 52. © Huxi LI, 2018 A2 - HADOOP CLUSTER SETUP 52 STEP 2 – START CREATION WIZARD Click Here
  • 53. © Huxi LI, 2018 A2 - HADOOP CLUSTER SETUP 53 STEP 3 – CHOOSE CLUSTER NAME
  • 54. © Huxi LI, 2018 A2 - HADOOP CLUSTER SETUP 54 STEP 4 – CHOOSE VERSION
  • 55. © Huxi LI, 2018 A2 - HADOOP CLUSTER SETUP 55 STEP 5 – CHOOSE PARAMETERS
  • 56. © Huxi LI, 2018 A2 - HADOOP CLUSTER SETUP 56 STEP 6 – AMBARI HOST CHECK (FAILED) Error
  • 57. © Huxi LI, 2018 • vi /etc/hostname and using the right hostname • hdp-master.huxili.com • hdp-slave1.huxili.com • hdp-slave2.huxili.com A2 - HADOOP CLUSTER SETUP 57 STEP 6 – UPDATE HOSTNAMES AND RETRY
  • 58. © Huxi LI, 2018 A2 - HADOOP CLUSTER SETUP 58 STEP 7 – CHOOSE REQUIRED SERVICES
  • 59. © Huxi LI, 2018 A2 - HADOOP CLUSTER SETUP 59 STEP 8 – DISTRIBUTION OF COMPONENTS
  • 60. © Huxi LI, 2018 A2 - HADOOP CLUSTER SETUP 60 STEP 8 – DISTRIBUTION OF COMPONENTS
  • 61. © Huxi LI, 2018 A2 - HADOOP CLUSTER SETUP 61 STEP 9 – CONFIGURE HIVE DB (POSTGRES)
  • 62. © Huxi LI, 2018 A2 - HADOOP CLUSTER SETUP 62 STEP 10 – CONTINUED
  • 63. © Huxi LI, 2018 A2 - HADOOP CLUSTER SETUP 63 STEP 11 – CONTINUED
  • 64. © Huxi LI, 2018 A2 - HADOOP CLUSTER SETUP 64 STEP 12 – CONTINUED
  • 65. © Huxi LI, 2018 A2 - HADOOP CLUSTER SETUP 65 STEP 13 – CONTINUED
  • 66. © Huxi LI, 2018 A2 - HADOOP CLUSTER SETUP 66 STEP 14 – INSTALLATION FAILED
  • 67. © Huxi LI, 2018 • Error messages : • Solutions : manually install missing package • sudo yum-config-manager --enable rhui-REGION-rhel-server-optional • sudo yum install libtirpc-devel A2 - HADOOP CLUSTER SETUP 67 STEP 15 – ANALYZE ERROR MESSAGE Execution of '/usr/bin/yum -d 0 -e 0 -y install hadoop_2_6_4_0_91-client' returned 1. Error: Package: hadoop_2_6_4_0_91-hdfs-2.7.3.2.6.4.0-91.x86_64 (HDP-2.6-repo-1) Requires: libtirpc-devel You could try using --skip-broken to work around the problem You could try running: rpm -Va --nofiles --nodigest
  • 68. © Huxi LI, 2018 A2 - HADOOP CLUSTER SETUP 68 STEP 16 – INSTALL MISSING PACKAGES
  • 69. © Huxi LI, 2018 A2 - HADOOP CLUSTER SETUP 69 STEP 17 – INSTALL SUCCESS
  • 70. © Huxi LI, 2018 A2 - HADOOP CLUSTER SETUP 70 STEP 18 – AMBARI DASHBOARD
  • 71. © Huxi LI, 2018 A2 - HADOOP CLUSTER SETUP 71 STEP 19 – CREATE USER ‘HUXILI’
  • 72. © Huxi LI, 2018 ANNEXES A1|Ambari Server Setup A2|Hadoop Cluster Setup • Install Hadoop components • Check cluster sanity A3|Streaming Architecture Hive/Storm 72
  • 73. © Huxi LI, 2018 • Open HDFS File View failed: A2 - HADOOP CLUSTER SETUP 73 CHECK FILE VIEW
  • 74. © Huxi LI, 2018 • Check view definitions (cluster name is empty): A2 - HADOOP CLUSTER SETUP 74 CHECK FILE VIEW
  • 75. © Huxi LI, 2018 • HDFS File View after filling cluster name: A2 - HADOOP CLUSTER SETUP 75 CHECK FILE VIEW
  • 76. © Huxi LI, 2018 • Open Hive View failed: • Missing ‘/user/huxili’ A2 - HADOOP CLUSTER SETUP 76 CHECK HIVE VIEW
  • 77. © Huxi LI, 2018 • Create directory for custom Hive user: • hdfs dfs -ls / && hdfs dfs -ls /user A2 - HADOOP CLUSTER SETUP 77 CHECK HIVE VIEW
  • 78. © Huxi LI, 2018 • Check HDFS: • hdfs dfs -getfacl /user • Create /user/huxili & admin • su hdfs • hdfs dfs -mkdir /user/huxili & hdfs dfs -chown huxili /user/huxili • hdfs dfs -mkdir /user/admin & hdfs dfs -chown admin /user/admin A2 - HADOOP CLUSTER SETUP 78 CHECK HIVE VIEW
  • 79. © Huxi LI, 2018 • Open Hive View OK after creation of missing directory: A2 - HADOOP CLUSTER SETUP 79 CHECK HIVE VIEW
  • 80. © Huxi LI, 2018 • Open Hive View 2.0 (OK) : A2 - HADOOP CLUSTER SETUP 80 CHECK HIVE VIEW 2.0
  • 81. © Huxi LI, 2018 • Check service binding: • netstat -tulpn A2 - HADOOP CLUSTER SETUP 81 CHECK SERVICE BINDING
  • 82. © Huxi LI, 2018 • Service binding problems: • Zookeeper listening only on IP6 and clustering join failed : • Log error messages: • (ClientCnxn.java:logStartConnect(1019)) - Opening socket connection to server hdp-slave1.huxili.coms/... Will not attempt to authenticate using SASL (unknown error) • Solution: • Force zookeeper use IP4 by adding the following line in “Advanced zookeeper-env => template”: ‐ export SERVER_JVMFLAGS="$SERVER_JVMFLAGS -Djava.net.preferIPv4Stack=true" A2 - HADOOP CLUSTER SETUP 82 CHECK SERVICE BINDING
  • 83. © Huxi LI, 2018 • Force IP4 for the following services: • ambari-infra-solr : • SOLR_OPTS="$SOLR_OPTS -Djava.net.preferIPv4Stack=true“ • SmartSense: • export ANALYZER_JAVA_OPTS="{{analyzer_jvm_opts}} -Djava.net.preferIPv4Stack=true -Xmx{{analyzer_jvm_heap}}m“ • export ZEPPELIN_JAVA_OPTS="-Dhdp.version={{hdp_version}} -Dlog.file.name=activity-explorer.log - DSmartSenseActivityExplorer -Djava.net.preferIPv4Stack=true“ • YARN: • YARN_OPTS="$YARN_OPTS -Djava.net.preferIPv4Stack=true“ • STORM: • DRPC: -Xmx768m _JAAS_PLACEHOLDER -Djava.net.preferIPv4Stack=true • Numbus: -Xmx1024m _JAAS_PLACEHOLDER -Djava.net.preferIPv4Stack=true • Supervisor & UI & Storm-site/logviewer: -Djava.net.preferIPv4Stack=true A2 - HADOOP CLUSTER SETUP 83 CHECK SERVICE BINDING
  • 84. © Huxi LI, 2018 A2 - HADOOP CLUSTER SETUP 84 SERVICE INSTALL SUCCESSFUL !
  • 85. © Huxi LI, 2018 ANNEXES A1|Ambari Server A2|Hadoop Cluster Setup A3|Streaming Architecture Hive/Storm • Store twitter data in HDFS • Hive Query of Twitter data • Twitter streaming with Storm 85
  • 86. © Huxi LI, 2018 • Create account huxili using Ambari • Create directories in HDFS : • Login as hdfs (su hdfs) • Create /user/huxili • Change owner to huxili • Now login to ambari as user ‘huxili’ and open ‘Files view’ from the menu. Since huxili is the owner of the newly created directory, we can add data into this directory. I create the following directory to save Twitter data: • /user/huxili/twitter_data A3 - STREAMING ARCHITECTURE HIVE/STORM 86 STORE TWITTER DATA
  • 87. © Huxi LI, 2018 • Open Hive view 2.0 and execute: • CREATE DATABASE `huxili` • Create Table ‘tweets’ : SET hive.support.sql11.reserved.keywords=false; DROP TABLE IF EXISTS tweets; CREATE EXTERNAL TABLE tweets (createddate string, geolocation string, tweetmessage string, `user` struct<geoenabled:boolean, id:int, name:string, screenname:string, userlocation:string>) ROW FORMAT SERDE 'org.apache.hive.hcatalog.data.JsonSerDe' STORED AS TEXTFILE LOCATION '/user/huxili/twitter_data/' A3 - STREAMING ARCHITECTURE HIVE/STORM 87 HIVE QUERY OF TWITTER DATA – CREATE TABLE
  • 88. © Huxi LI, 2018 • Create Table ‘tweets’ OK: A3 - STREAMING ARCHITECTURE HIVE/STORM 88 HIVE QUERY OF TWITTER DATA – CREATE TABLE CONTINUED
  • 89. © Huxi LI, 2018 • SET hive.support.sql11.reserved.keywords=false; SELECT DISTINCT tweetmessage, user.name, createddate FROM tweets WHERE user.name = 'Aimee_Cottle'; A3 - STREAMING ARCHITECTURE HIVE/STORM 89 HIVE QUERY OF TWITTER DATA – QUERY DATA Failed with error messages
  • 90. © Huxi LI, 2018 • Using another parser: • curl http://www.congiu.net/hive-json-serde/1.3.8/cdh5/json-serde-1.3.8-jar-with-dependencies.jar -O • curl http://www.congiu.net/hive-json-serde/1.3.8/cdh5/json-udf-1.3.8-jar-with-dependencies.jar -O • cp *.jar /usr/hdp/2.6.4.0-91/hive/lib/ & cp *.jar /usr/hdp/2.6.4.0-91/hive2/lib/ • Restart Hive (Ambari) • In Hive View2, execute: • ADD JAR /usr/hdp/2.6.4.0-91/hive2/lib/json-serde-1.3.8-jar-with-dependencies.jar; A3 - STREAMING ARCHITECTURE HIVE/STORM 90 HIVE QUERY OF TWITTER DATA – TRY ANOTHER PARSER
  • 91. © Huxi LI, 2018 • Create tables ‘tweets2’ using OpenX: A3 - STREAMING ARCHITECTURE HIVE/STORM 91 HIVE QUERY OF TWITTER DATA – TRY ANOTHER PARSER CREATE EXTERNAL TABLE tweets2 (createddate string, geolocation string, tweetmessage string, `user` struct<geoenabled:boolean, id:int, name:string, screenname:string, userlocation:string>) ROW FORMAT SERDE 'org.openx.data.jsonserde.JsonSerDe’ LOCATION '/user/huxili/twitter_data/'
  • 92. © Huxi LI, 2018 • Query twitter data on the new table: • SELECT * FROM tweets2 LIMIT 100; • It works ! A3 - STREAMING ARCHITECTURE HIVE/STORM 92 HIVE QUERY OF TWITTER DATA – TRY ANOTHER PARSER
  • 93. © Huxi LI, 2018 • Query twitter data on the new table: • SET hive.support.sql11.reserved.keywords=false; SELECT tweetmessage, user.name, createddate FROM tweets2 WHERE user.screenname = 'Aimee_Cottle'; A3 - STREAMING ARCHITECTURE HIVE/STORM 93 HIVE QUERY OF TWITTER DATA – TRY ANOTHER PARSER OK
  • 94. © Huxi LI, 2018 • Objective: • In this example, I will use STORM and twitter4j to analyzer real time tweets information, and then save the analysis result to HDFS of the Hadoop cluster. • Components: • TwitterSampleSpout – This is a STORM Spout responsible for reading of twitter stream using twitter4j (http://twitter4j.org). • WordSplitterBolt – This is a STORM Bolt responsible for splitting tweets into words for analysis. • IgnoreWordsBolt – This is a STORM Bolt responsible for filtering unwanted words. • WordCounterBolt – This is a STORM Bolt responsible for word frequency analysis. • HdfsWriterBolt – This is a STORM Bolt responsible for writing analysis result to HDFS. A3 - STREAMING ARCHITECTURE HIVE/STORM 94 STREAMING WITH STORM – TWITTER EXAMPLE
  • 95. © Huxi LI, 2018 • Setup of twitter streaming (OAUTH): A3 - STREAMING ARCHITECTURE HIVE/STORM 95 STREAMING WITH STORM – TWITTER EXAMPLE
  • 96. © Huxi LI, 2018 • STORM Topology: A3 - STREAMING ARCHITECTURE HIVE/STORM 96 STREAMING WITH STORM – TWITTER EXAMPLE StreamingSpout Tweets Splitter Bolt Filter Bolt Counter Bolt Aggregation & HDFS writing Bolt Hadoop HDFS
  • 97. © Huxi LI, 2018 • Java Components: • AppContext: • Configuration of application. • HdfsWriterBolt: • Aggregation & Output to HDFS. • TwitterSampleSpout: • Streaming of tweets using twitter4j. • WordSplitterBolt, IgnoreWordsBolt, WordCounterBolt : • Splitting, filtering, and counting bolts. • Resources: • Configuration files for twitter4j and OAUTH authentication. A3 - STREAMING ARCHITECTURE HIVE/STORM 97 STREAMING WITH STORM – TWITTER EXAMPLE
  • 98. © Huxi LI, 2018 • STORM Topology (Java): A3 - STREAMING ARCHITECTURE HIVE/STORM 98 STREAMING WITH STORM – TWITTER EXAMPLE
  • 99. © Huxi LI, 2018 • Writing to HDSF (Java): A3 - STREAMING ARCHITECTURE HIVE/STORM 99 STREAMING WITH STORM – TWITTER EXAMPLE
  • 100. © Huxi LI, 2018 • Packaging and execution: • Using maven • Plugin shade A3 - STREAMING ARCHITECTURE HIVE/STORM 100 STREAMING WITH STORM – TWITTER EXAMPLE
  • 101. © Huxi LI, 2018 • Procedure of execution: 1. Create shaded jar (mvn clean install -Pcluster) 2. Uploading the generated application to HDFS using Ambari File View 3. Login onto slave1 (storm installed on slave1) 4. Perform the following operations: ‐ sudo su ‐ hdfs dfs -get /user/huxili/storm_apps/*.jar ‐ storm jar twitter-stream-example.jar com.huxili.storm.twitter.example.Topology 5. Observe the execution (see logs) 6. When the execution terminate, check the output file saved in HDFS (/user/huxili/storm_out/counts.txt) A3 - STREAMING ARCHITECTURE HIVE/STORM 101 STREAMING WITH STORM – TWITTER EXAMPLE
  • 102. © Huxi LI, 2018 • Example execution logs: A3 - STREAMING ARCHITECTURE HIVE/STORM 102 STREAMING WITH STORM – TWITTER EXAMPLE
  • 103. © Huxi LI, 2018 • Example execution logs: A3 - STREAMING ARCHITECTURE HIVE/STORM 103 STREAMING WITH STORM – TWITTER EXAMPLE
  • 104. © Huxi LI, 2018 • Example results: A3 - STREAMING ARCHITECTURE HIVE/STORM 104 STREAMING WITH STORM – TWITTER EXAMPLE
  • 105. © Huxi LI, 2018105 Progress. Together.