Virtualized Big Data Platform at VMware Corp IT @ VMWorld 2015

© 2014 VMware Inc. All rights reserved.
Virtualized Big Data Platform
@ VMware Corp IT
Rajit Saha
Hadoop Development Lead
VMware Corp IT Data Solution and Delivery
An Enterprise Data Warehouse meets an Elephant

2
Business Use Case for Big Data Analytics
@ VMware BI Space
Personalized Marketing & Customer Targeting
Personalized Campaign Content Strategy
MyVMware Log Analytics
Combine User Level data -
logins and other activities with
Clickstream Data and Product
Data
VMware Product’s List Price Optimization and
Deal Analytics for VMware Pricing Team
- Complex ETL, Bigger Joins
- Flattening Star Schema Tables
- Propensity Modeling
E
D
W
- Deeper Learning of VMware Product Issues
- Build highly intelligent recommendation
System to fix Customer Issues with faster turn
around time
GSS Service Request
Logs Analytics
- High Volume ~ 400TB
- A lot of Variety of data
- Complex parsing
Clickstream Data Analytics
• Path analysis – First user visit to buy
product
• Propensity Modeling
• Predictive Analytics - which product
user will buy
• Customer Lifetime Value Analysis
554 columns
1.5B Rows
20TB Data (
2yrs)
Variety
Volume
Velocity
B
I
G
D
A
T
A

3
• This Big Data Cluster is fully Virtualized
• based on vSphere 6.0 and VMware Big Data Extensions 2.2
• We used EMC Isilon 7.2.0.2 with two patches for HDFS Storage
• We used Pivotal Big Data Suite 3.0 for Hadoop 2.6 and HAWQ 1.3
• We used Pivotal Spring XD 1.2 for Data Ingestion to Hadoop
• We integrated this with Alpine Data Lab 5.4 for running
• Deeper Analytic Functions
• Machine Learning Algorithms
• We integrated HUE 2.6 for GUI based HIVE/PIG Query execution client
Components of Big Data Cluster

4
NAS Shared Storage
[HadoopTempSpace]
H S3
H S 4
PXFH A WQ Segment 1
H A WQ Segment 4
Pivotal Extension Framework
ZK Zookeeper Server
SC Spring XD Container
N M YA RN Node M anager
10G Data Link
I si lon
H
D
F
S
I si lon
I si lon
I si lon
I si lon
· VMWare Big Data Extension 2.2 - provisions H adoop VM s
· A pplication Stack: Pivotal HD (PHD 3.0), Spring XD 1.2, RabbitMQ 3.5.3, PostGres
9.4
· A nalytics Tool : Alpine Data Lab 5.4
· H DFS Storage: EMC Isilon 7.2.0.2 + Restricted Patch-14925
· H DFS Capacity : 30T
· Temp Storage on : vmdks on VNX NAS
· 4 HAWQ Segments & 4 Mapred Local Directories w ill be mounted on 4 VMDKs
on NFS in Worker VMs
HADOOP WORKER 1
8 vCPU & 52G RAM
NM HS 4
HS 3
PXF
200G
HS 1
HS 2
200G 200G 200G
HADOOP WORKER 2
8 vCPU & 52G RAM
NM HS 4
HS 3
PXF
200G
HS 1
HS 2
200G 200G 200G
HADOOP WORKER 3
8 vCPU & 52G RAM
NM HS 4
HS 3
PXFZK
200G
HS 1
HS 2
200G 200G 200G
HADOOP WORKER 4
8 vCPU & 52G RAM
NM HS 4
HS 3
PXFZK
200G
HS 1
HS 2
200G 200G 200G
HADOOP WORKER 5
8 vCPU & 52G RAM
NM HS 4
HS 3
PXFZK
200G
HS 1
HS 2
200G 200G 200G
H S2
H S1
H A WQ Segment 2
H A WQ Segment 3
180G
HADOOP MASTER 2
8 vCPU & 48G RAM
HAWQ Master
Standby RM
History Server
App Timeline Server
180G
HADOOP MASTER 1
8 vCPU & 48G RAM
Active RM
HAWQ Master Standby
Hive Server2
Hive -Metastore
180G
HADOOP CLIENT
4 vCPU & 36G RAM
Clients
HCat, HDFS,
Hive,
MapReduce2, Pig
, Tez, YARN,
ZooKeeper
Spring XD
Admin
POSTGRES
RABBITMQ
200G
MANAGEMENT
4 vCPU & 12G RAM
NAGIOS
AMBARI
GANGLIA
A LPIN E DA TA
LAB(PROD)
8vCPU & 48G
RA M
500G
A LPIN E DA TA
LAB (STAGE)
8vCPU & 48G
RA M
500G
VM w are Corp IT Big Data A nalytic Platform [ Production ] – A pplication A rchitecture Stack
HUE
SC
Hive MySQL
Web HCat
Server
SC

5
On-Prem Big Data Production Datacenter

6
Apache Ambari
– The Hadoop Cluster Management Console
Management
&
Monitoring
- HDFS
- Yarn/Map reduce
- Hive
- HAWQ
- Spring XD

Clickstream
ftps.vmware.com
raw data files
firewall
Daily push of
Clickstream Logs
Data Ingestion to Isilon HDFS
via Spring XD
Lookup
Logs
Clickstream
Logs
Adv. Analytics
Users
• Data Cleaning
• Better Consumable
Structured data
• Data Partitioning
• Schema Building
• Faster Analytic Power
- Daily 2M Clickstream Records ( ~10GB ) ares being ingested
from Adobe Omniture to Isilon HDFS
- 1.5Billion Records and 554 columns and ~20TB of
data
- Data Cleanup and Pre Processing using PIG, Hadoop
Streaming and Python Scripts
- Fit the Data into HIVE/HAWQ Schema
- End Users ( Data Scientists ) consume via HUE/pgAdmin/Alpine
Data Lab
python
Data Processing Pipeline – Click Stream Data

8
Data Consumption – pgAdmin3 ( via HAWQ Database) ….

9
And visualize the results ..
37%
7%
7%6%
6%
6%
4%
4%
3%
3%
2%
2%
2%
2%
2% 1%
1% 1% 1% 1%
Top 20 Countries with unique
vmware.com Visits
on 2015 Q1 usa
jpn
deu
gbr
chn
ind
can
fra
aus
kor
esp
bra
34%
7%
7%
6%
10%
6%
3%
3%
2%
4%
3%
3%
2%
2%
2% 1%
1%
1%
1% 1%
Top 20 Countries with unique
vmware.com Visitors
on 2015 Q1
usa
jpn
deu
gbr
chn
ind
can
fra
aus
kor
esp
bra
ita
nld
rus
che
twn
pol
mex
swe
Disclaimer : This is based on Synthesized Dataset for demo purpose, not
Real Data

10
Data Consumption – HUE
Hive Query to find out unique
visits in VMware site 2015 Q1 0
2000000
4000000
6000000
8000000
10000000
12000000
14000000
VisitCount
Month
Unique Visits in 2014 and 2015 month wise
visits
Disclaimer : This is based on Synthesized Dataset for demo purpose, not
Real Data

11
Advanced Data Analytics by Alpine Data Lab
Time Series Analysis on Jan 2015
Clickstream Data

12
At VMware IT, we have established the fact that an
Enterprise Big Data Analytics Platform can be
successfully built and run on top of VMware Virtual
Infrastructure with EMC Isilon and PHD 3.0
-with great performance
Take Away …

Virtualized Big Data Platform at VMware Corp IT @ VMWorld 2015

Recommended

Recommended

More Related Content

What's hot

What's hot (20)

Viewers also liked

Viewers also liked (15)

Similar to Virtualized Big Data Platform at VMware Corp IT @ VMWorld 2015

Similar to Virtualized Big Data Platform at VMware Corp IT @ VMWorld 2015 (20)

Recently uploaded

Recently uploaded (20)

Virtualized Big Data Platform at VMware Corp IT @ VMWorld 2015