SlideShare a Scribd company logo
1 of 13
© 2014 VMware Inc. All rights reserved.
Virtualized Big Data Platform
@ VMware Corp IT
Rajit Saha
Hadoop Development Lead
VMware Corp IT Data Solution and Delivery
An Enterprise Data Warehouse meets an Elephant
2
Business Use Case for Big Data Analytics
@ VMware BI Space
Personalized Marketing & Customer Targeting
Personalized Campaign Content Strategy
MyVMware Log Analytics
Combine User Level data -
logins and other activities with
Clickstream Data and Product
Data
VMware Product’s List Price Optimization and
Deal Analytics for VMware Pricing Team
- Complex ETL, Bigger Joins
- Flattening Star Schema Tables
- Propensity Modeling
E
D
W
- Deeper Learning of VMware Product Issues
- Build highly intelligent recommendation
System to fix Customer Issues with faster turn
around time
GSS Service Request
Logs Analytics
- High Volume ~ 400TB
- A lot of Variety of data
- Complex parsing
Clickstream Data Analytics
• Path analysis – First user visit to buy
product
• Propensity Modeling
• Predictive Analytics - which product
user will buy
• Customer Lifetime Value Analysis
554 columns
1.5B Rows
20TB Data (
2yrs)
Variety
Volume
Velocity
B
I
G
D
A
T
A
3
• This Big Data Cluster is fully Virtualized
• based on vSphere 6.0 and VMware Big Data Extensions 2.2
• We used EMC Isilon 7.2.0.2 with two patches for HDFS Storage
• We used Pivotal Big Data Suite 3.0 for Hadoop 2.6 and HAWQ 1.3
• We used Pivotal Spring XD 1.2 for Data Ingestion to Hadoop
• We integrated this with Alpine Data Lab 5.4 for running
• Deeper Analytic Functions
• Machine Learning Algorithms
• We integrated HUE 2.6 for GUI based HIVE/PIG Query execution client
Components of Big Data Cluster
4
NAS Shared Storage
[HadoopTempSpace]
H S3
H S 4
PXFH A WQ Segment 1
H A WQ Segment 4
Pivotal Extension Framework
ZK Zookeeper Server
SC Spring XD Container
N M YA RN Node M anager
10G Data Link
I si lon
H
D
F
S
I si lon
I si lon
I si lon
I si lon
· VMWare Big Data Extension 2.2 - provisions H adoop VM s
· A pplication Stack: Pivotal HD (PHD 3.0), Spring XD 1.2, RabbitMQ 3.5.3, PostGres
9.4
· A nalytics Tool : Alpine Data Lab 5.4
· H DFS Storage: EMC Isilon 7.2.0.2 + Restricted Patch-14925
· H DFS Capacity : 30T
· Temp Storage on : vmdks on VNX NAS
· 4 HAWQ Segments & 4 Mapred Local Directories w ill be mounted on 4 VMDKs
on NFS in Worker VMs
HADOOP WORKER 1
8 vCPU & 52G RAM
NM HS 4
HS 3
PXF
200G
HS 1
HS 2
200G 200G 200G
HADOOP WORKER 2
8 vCPU & 52G RAM
NM HS 4
HS 3
PXF
200G
HS 1
HS 2
200G 200G 200G
HADOOP WORKER 3
8 vCPU & 52G RAM
NM HS 4
HS 3
PXFZK
200G
HS 1
HS 2
200G 200G 200G
HADOOP WORKER 4
8 vCPU & 52G RAM
NM HS 4
HS 3
PXFZK
200G
HS 1
HS 2
200G 200G 200G
HADOOP WORKER 5
8 vCPU & 52G RAM
NM HS 4
HS 3
PXFZK
200G
HS 1
HS 2
200G 200G 200G
H S2
H S1
H A WQ Segment 2
H A WQ Segment 3
180G
HADOOP MASTER 2
8 vCPU & 48G RAM
HAWQ Master
Standby RM
History Server
App Timeline Server
180G
HADOOP MASTER 1
8 vCPU & 48G RAM
Active RM
HAWQ Master Standby
Hive Server2
Hive -Metastore
180G
HADOOP CLIENT
4 vCPU & 36G RAM
Clients
HCat, HDFS,
Hive,
MapReduce2, Pig
, Tez, YARN,
ZooKeeper
Spring XD
Admin
POSTGRES
RABBITMQ
200G
MANAGEMENT
4 vCPU & 12G RAM
NAGIOS
AMBARI
GANGLIA
A LPIN E DA TA
LAB(PROD)
8vCPU & 48G
RA M
500G
A LPIN E DA TA
LAB (STAGE)
8vCPU & 48G
RA M
500G
VM w are Corp IT Big Data A nalytic Platform [ Production ] – A pplication A rchitecture Stack
HUE
SC
Hive MySQL
Web HCat
Server
SC
5
On-Prem Big Data Production Datacenter
6
Apache Ambari
– The Hadoop Cluster Management Console
Management
&
Monitoring
- HDFS
- Yarn/Map reduce
- Hive
- HAWQ
- Spring XD
Clickstream
ftps.vmware.com
raw data files
firewall
Daily push of
Clickstream Logs
Data Ingestion to Isilon HDFS
via Spring XD
Lookup
Logs
Clickstream
Logs
Adv. Analytics
Users
• Data Cleaning
• Better Consumable
Structured data
• Data Partitioning
• Schema Building
• Faster Analytic Power
- Daily 2M Clickstream Records ( ~10GB ) ares being ingested
from Adobe Omniture to Isilon HDFS
- 1.5Billion Records and 554 columns and ~20TB of
data
- Data Cleanup and Pre Processing using PIG, Hadoop
Streaming and Python Scripts
- Fit the Data into HIVE/HAWQ Schema
- End Users ( Data Scientists ) consume via HUE/pgAdmin/Alpine
Data Lab
python
Data Processing Pipeline – Click Stream Data
8
Data Consumption – pgAdmin3 ( via HAWQ Database) ….
9
And visualize the results ..
37%
7%
7%6%
6%
6%
4%
4%
3%
3%
2%
2%
2%
2%
2% 1%
1% 1% 1% 1%
Top 20 Countries with unique
vmware.com Visits
on 2015 Q1 usa
jpn
deu
gbr
chn
ind
can
fra
aus
kor
esp
bra
34%
7%
7%
6%
10%
6%
3%
3%
2%
4%
3%
3%
2%
2%
2% 1%
1%
1%
1% 1%
Top 20 Countries with unique
vmware.com Visitors
on 2015 Q1
usa
jpn
deu
gbr
chn
ind
can
fra
aus
kor
esp
bra
ita
nld
rus
che
twn
pol
mex
swe
Disclaimer : This is based on Synthesized Dataset for demo purpose, not
Real Data
10
Data Consumption – HUE
Hive Query to find out unique
visits in VMware site 2015 Q1 0
2000000
4000000
6000000
8000000
10000000
12000000
14000000
VisitCount
Month
Unique Visits in 2014 and 2015 month wise
visits
Disclaimer : This is based on Synthesized Dataset for demo purpose, not
Real Data
11
Advanced Data Analytics by Alpine Data Lab
Time Series Analysis on Jan 2015
Clickstream Data
12
At VMware IT, we have established the fact that an
Enterprise Big Data Analytics Platform can be
successfully built and run on top of VMware Virtual
Infrastructure with EMC Isilon and PHD 3.0
-with great performance
Take Away …
13
Thank You
QA

More Related Content

What's hot

Data Gloveboxes: A Philosophy of Data Science Data Security
Data Gloveboxes: A Philosophy of Data Science Data SecurityData Gloveboxes: A Philosophy of Data Science Data Security
Data Gloveboxes: A Philosophy of Data Science Data SecurityDataWorks Summit
 
Cassandra on Google Cloud Platform (Ravi Madasu, Google / Ben Lackey, DataSta...
Cassandra on Google Cloud Platform (Ravi Madasu, Google / Ben Lackey, DataSta...Cassandra on Google Cloud Platform (Ravi Madasu, Google / Ben Lackey, DataSta...
Cassandra on Google Cloud Platform (Ravi Madasu, Google / Ben Lackey, DataSta...DataStax
 
Building Data Applications with Apache Druid
Building Data Applications with Apache DruidBuilding Data Applications with Apache Druid
Building Data Applications with Apache DruidImply
 
Benchmarking Apache Druid
Benchmarking Apache Druid Benchmarking Apache Druid
Benchmarking Apache Druid Matt Sarrel
 
Apache Druid Design and Future prospect
Apache Druid Design and Future prospectApache Druid Design and Future prospect
Apache Druid Design and Future prospectc-bslim
 
Data Analytics and Processing at Snap - Druid Meetup LA - September 2018
Data Analytics and Processing at Snap - Druid Meetup LA - September 2018Data Analytics and Processing at Snap - Druid Meetup LA - September 2018
Data Analytics and Processing at Snap - Druid Meetup LA - September 2018Charles Allen
 
Hybrid solutions – combining in memory solutions with SSD - Christos Erotocritou
Hybrid solutions – combining in memory solutions with SSD - Christos ErotocritouHybrid solutions – combining in memory solutions with SSD - Christos Erotocritou
Hybrid solutions – combining in memory solutions with SSD - Christos ErotocritouJAXLondon_Conference
 
Big Data at Tube: Events to Insights to Action
Big Data at Tube: Events to Insights to ActionBig Data at Tube: Events to Insights to Action
Big Data at Tube: Events to Insights to ActionMurtaza Doctor
 
Programmatic Bidding Data Streams & Druid
Programmatic Bidding Data Streams & DruidProgrammatic Bidding Data Streams & Druid
Programmatic Bidding Data Streams & DruidCharles Allen
 
NoSQL no more: SQL on Druid with Apache Calcite
NoSQL no more: SQL on Druid with Apache CalciteNoSQL no more: SQL on Druid with Apache Calcite
NoSQL no more: SQL on Druid with Apache Calcitegianmerlino
 
Change Data Streaming Patterns for Microservices With Debezium
Change Data Streaming Patterns for Microservices With Debezium Change Data Streaming Patterns for Microservices With Debezium
Change Data Streaming Patterns for Microservices With Debezium confluent
 
August meetup - All about Apache Druid
August meetup - All about Apache Druid August meetup - All about Apache Druid
August meetup - All about Apache Druid Imply
 
Ebooks - Accelerating Time to Value of Big Data of Apache Spark | Qubole
Ebooks - Accelerating Time to Value of Big Data of Apache Spark | QuboleEbooks - Accelerating Time to Value of Big Data of Apache Spark | Qubole
Ebooks - Accelerating Time to Value of Big Data of Apache Spark | QuboleVasu S
 
Archmage, Pinterest’s Real-time Analytics Platform on Druid
Archmage, Pinterest’s Real-time Analytics Platform on DruidArchmage, Pinterest’s Real-time Analytics Platform on Druid
Archmage, Pinterest’s Real-time Analytics Platform on DruidImply
 
Redis + Structured Streaming—A Perfect Combination to Scale-Out Your Continuo...
Redis + Structured Streaming—A Perfect Combination to Scale-Out Your Continuo...Redis + Structured Streaming—A Perfect Combination to Scale-Out Your Continuo...
Redis + Structured Streaming—A Perfect Combination to Scale-Out Your Continuo...Databricks
 
Performance Analysis of Apache Spark and Presto in Cloud Environments
Performance Analysis of Apache Spark and Presto in Cloud EnvironmentsPerformance Analysis of Apache Spark and Presto in Cloud Environments
Performance Analysis of Apache Spark and Presto in Cloud EnvironmentsDatabricks
 
C*ollege Credit: CEP Distribtued Processing on Cassandra with Storm
C*ollege Credit: CEP Distribtued Processing on Cassandra with StormC*ollege Credit: CEP Distribtued Processing on Cassandra with Storm
C*ollege Credit: CEP Distribtued Processing on Cassandra with StormDataStax
 
Druid at Hadoop Ecosystem
Druid at Hadoop EcosystemDruid at Hadoop Ecosystem
Druid at Hadoop EcosystemSlim Bouguerra
 

What's hot (20)

Data Gloveboxes: A Philosophy of Data Science Data Security
Data Gloveboxes: A Philosophy of Data Science Data SecurityData Gloveboxes: A Philosophy of Data Science Data Security
Data Gloveboxes: A Philosophy of Data Science Data Security
 
Cassandra on Google Cloud Platform (Ravi Madasu, Google / Ben Lackey, DataSta...
Cassandra on Google Cloud Platform (Ravi Madasu, Google / Ben Lackey, DataSta...Cassandra on Google Cloud Platform (Ravi Madasu, Google / Ben Lackey, DataSta...
Cassandra on Google Cloud Platform (Ravi Madasu, Google / Ben Lackey, DataSta...
 
Druid @ branch
Druid @ branch Druid @ branch
Druid @ branch
 
Building Data Applications with Apache Druid
Building Data Applications with Apache DruidBuilding Data Applications with Apache Druid
Building Data Applications with Apache Druid
 
Benchmarking Apache Druid
Benchmarking Apache Druid Benchmarking Apache Druid
Benchmarking Apache Druid
 
Apache Druid Design and Future prospect
Apache Druid Design and Future prospectApache Druid Design and Future prospect
Apache Druid Design and Future prospect
 
Data Analytics and Processing at Snap - Druid Meetup LA - September 2018
Data Analytics and Processing at Snap - Druid Meetup LA - September 2018Data Analytics and Processing at Snap - Druid Meetup LA - September 2018
Data Analytics and Processing at Snap - Druid Meetup LA - September 2018
 
Hybrid solutions – combining in memory solutions with SSD - Christos Erotocritou
Hybrid solutions – combining in memory solutions with SSD - Christos ErotocritouHybrid solutions – combining in memory solutions with SSD - Christos Erotocritou
Hybrid solutions – combining in memory solutions with SSD - Christos Erotocritou
 
Big Data at Tube: Events to Insights to Action
Big Data at Tube: Events to Insights to ActionBig Data at Tube: Events to Insights to Action
Big Data at Tube: Events to Insights to Action
 
Programmatic Bidding Data Streams & Druid
Programmatic Bidding Data Streams & DruidProgrammatic Bidding Data Streams & Druid
Programmatic Bidding Data Streams & Druid
 
NoSQL no more: SQL on Druid with Apache Calcite
NoSQL no more: SQL on Druid with Apache CalciteNoSQL no more: SQL on Druid with Apache Calcite
NoSQL no more: SQL on Druid with Apache Calcite
 
Change Data Streaming Patterns for Microservices With Debezium
Change Data Streaming Patterns for Microservices With Debezium Change Data Streaming Patterns for Microservices With Debezium
Change Data Streaming Patterns for Microservices With Debezium
 
August meetup - All about Apache Druid
August meetup - All about Apache Druid August meetup - All about Apache Druid
August meetup - All about Apache Druid
 
Ebooks - Accelerating Time to Value of Big Data of Apache Spark | Qubole
Ebooks - Accelerating Time to Value of Big Data of Apache Spark | QuboleEbooks - Accelerating Time to Value of Big Data of Apache Spark | Qubole
Ebooks - Accelerating Time to Value of Big Data of Apache Spark | Qubole
 
Archmage, Pinterest’s Real-time Analytics Platform on Druid
Archmage, Pinterest’s Real-time Analytics Platform on DruidArchmage, Pinterest’s Real-time Analytics Platform on Druid
Archmage, Pinterest’s Real-time Analytics Platform on Druid
 
Druid
DruidDruid
Druid
 
Redis + Structured Streaming—A Perfect Combination to Scale-Out Your Continuo...
Redis + Structured Streaming—A Perfect Combination to Scale-Out Your Continuo...Redis + Structured Streaming—A Perfect Combination to Scale-Out Your Continuo...
Redis + Structured Streaming—A Perfect Combination to Scale-Out Your Continuo...
 
Performance Analysis of Apache Spark and Presto in Cloud Environments
Performance Analysis of Apache Spark and Presto in Cloud EnvironmentsPerformance Analysis of Apache Spark and Presto in Cloud Environments
Performance Analysis of Apache Spark and Presto in Cloud Environments
 
C*ollege Credit: CEP Distribtued Processing on Cassandra with Storm
C*ollege Credit: CEP Distribtued Processing on Cassandra with StormC*ollege Credit: CEP Distribtued Processing on Cassandra with Storm
C*ollege Credit: CEP Distribtued Processing on Cassandra with Storm
 
Druid at Hadoop Ecosystem
Druid at Hadoop EcosystemDruid at Hadoop Ecosystem
Druid at Hadoop Ecosystem
 

Viewers also liked

WBDB 2014 Benchmarking Virtualized Hadoop Clusters
WBDB 2014 Benchmarking Virtualized Hadoop ClustersWBDB 2014 Benchmarking Virtualized Hadoop Clusters
WBDB 2014 Benchmarking Virtualized Hadoop Clusterst_ivanov
 
1. beyond mission critical virtualizing big data and hadoop
1. beyond mission critical   virtualizing big data and hadoop1. beyond mission critical   virtualizing big data and hadoop
1. beyond mission critical virtualizing big data and hadoopChiou-Nan Chen
 
VMworld 2013: Big Data Extensions: Advanced Features and Customer Case Study
VMworld 2013: Big Data Extensions: Advanced Features and Customer Case Study VMworld 2013: Big Data Extensions: Advanced Features and Customer Case Study
VMworld 2013: Big Data Extensions: Advanced Features and Customer Case Study VMworld
 
Soyez Big Data ready avec Isilon
Soyez Big Data ready avec IsilonSoyez Big Data ready avec Isilon
Soyez Big Data ready avec IsilonRSD
 
7. emc isilon hdfs enterprise storage for hadoop
7. emc isilon hdfs   enterprise storage for hadoop7. emc isilon hdfs   enterprise storage for hadoop
7. emc isilon hdfs enterprise storage for hadoopTaldor Group
 
Emerging Big Data & Analytics Trends with Hadoop
Emerging Big Data & Analytics Trends with Hadoop Emerging Big Data & Analytics Trends with Hadoop
Emerging Big Data & Analytics Trends with Hadoop InnoTech
 
EMC Hadoop Starter Kit
EMC Hadoop Starter KitEMC Hadoop Starter Kit
EMC Hadoop Starter KitEMC
 
Big data on virtualized infrastucture
Big data on virtualized infrastuctureBig data on virtualized infrastucture
Big data on virtualized infrastuctureDataWorks Summit
 
Gartner IT Symposium 2014 - VMware Cloud Services
Gartner IT Symposium 2014 - VMware Cloud ServicesGartner IT Symposium 2014 - VMware Cloud Services
Gartner IT Symposium 2014 - VMware Cloud ServicesPhilip Say
 
VMworld - vSphere Distributed Switch 6.0 Technical Deep Dive
VMworld - vSphere Distributed Switch 6.0 Technical Deep DiveVMworld - vSphere Distributed Switch 6.0 Technical Deep Dive
VMworld - vSphere Distributed Switch 6.0 Technical Deep DiveChris Wahl
 
Real World Application Orchestration Made Easy on VMware vCloud Air, vSphere ...
Real World Application Orchestration Made Easy on VMware vCloud Air, vSphere ...Real World Application Orchestration Made Easy on VMware vCloud Air, vSphere ...
Real World Application Orchestration Made Easy on VMware vCloud Air, vSphere ...Nati Shalom
 
Building Hadoop-as-a-Service with Pivotal Hadoop Distribution, Serengeti, & I...
Building Hadoop-as-a-Service with Pivotal Hadoop Distribution, Serengeti, & I...Building Hadoop-as-a-Service with Pivotal Hadoop Distribution, Serengeti, & I...
Building Hadoop-as-a-Service with Pivotal Hadoop Distribution, Serengeti, & I...EMC
 
Hadoop Analytics + Enterprise Class Storage: One-Stop Solution From EMC for H...
Hadoop Analytics + Enterprise Class Storage: One-Stop Solution From EMC for H...Hadoop Analytics + Enterprise Class Storage: One-Stop Solution From EMC for H...
Hadoop Analytics + Enterprise Class Storage: One-Stop Solution From EMC for H...EMC
 

Viewers also liked (15)

WBDB 2014 Benchmarking Virtualized Hadoop Clusters
WBDB 2014 Benchmarking Virtualized Hadoop ClustersWBDB 2014 Benchmarking Virtualized Hadoop Clusters
WBDB 2014 Benchmarking Virtualized Hadoop Clusters
 
1. beyond mission critical virtualizing big data and hadoop
1. beyond mission critical   virtualizing big data and hadoop1. beyond mission critical   virtualizing big data and hadoop
1. beyond mission critical virtualizing big data and hadoop
 
VMworld 2013: Big Data Extensions: Advanced Features and Customer Case Study
VMworld 2013: Big Data Extensions: Advanced Features and Customer Case Study VMworld 2013: Big Data Extensions: Advanced Features and Customer Case Study
VMworld 2013: Big Data Extensions: Advanced Features and Customer Case Study
 
Soyez Big Data ready avec Isilon
Soyez Big Data ready avec IsilonSoyez Big Data ready avec Isilon
Soyez Big Data ready avec Isilon
 
7. emc isilon hdfs enterprise storage for hadoop
7. emc isilon hdfs   enterprise storage for hadoop7. emc isilon hdfs   enterprise storage for hadoop
7. emc isilon hdfs enterprise storage for hadoop
 
Emerging Big Data & Analytics Trends with Hadoop
Emerging Big Data & Analytics Trends with Hadoop Emerging Big Data & Analytics Trends with Hadoop
Emerging Big Data & Analytics Trends with Hadoop
 
EMC Hadoop Starter Kit
EMC Hadoop Starter KitEMC Hadoop Starter Kit
EMC Hadoop Starter Kit
 
EMC config Hadoop
EMC config HadoopEMC config Hadoop
EMC config Hadoop
 
Big data on virtualized infrastucture
Big data on virtualized infrastuctureBig data on virtualized infrastucture
Big data on virtualized infrastucture
 
Gartner IT Symposium 2014 - VMware Cloud Services
Gartner IT Symposium 2014 - VMware Cloud ServicesGartner IT Symposium 2014 - VMware Cloud Services
Gartner IT Symposium 2014 - VMware Cloud Services
 
VMworld - vSphere Distributed Switch 6.0 Technical Deep Dive
VMworld - vSphere Distributed Switch 6.0 Technical Deep DiveVMworld - vSphere Distributed Switch 6.0 Technical Deep Dive
VMworld - vSphere Distributed Switch 6.0 Technical Deep Dive
 
Real World Application Orchestration Made Easy on VMware vCloud Air, vSphere ...
Real World Application Orchestration Made Easy on VMware vCloud Air, vSphere ...Real World Application Orchestration Made Easy on VMware vCloud Air, vSphere ...
Real World Application Orchestration Made Easy on VMware vCloud Air, vSphere ...
 
Cloud Management with vRealize Operations
Cloud Management with vRealize OperationsCloud Management with vRealize Operations
Cloud Management with vRealize Operations
 
Building Hadoop-as-a-Service with Pivotal Hadoop Distribution, Serengeti, & I...
Building Hadoop-as-a-Service with Pivotal Hadoop Distribution, Serengeti, & I...Building Hadoop-as-a-Service with Pivotal Hadoop Distribution, Serengeti, & I...
Building Hadoop-as-a-Service with Pivotal Hadoop Distribution, Serengeti, & I...
 
Hadoop Analytics + Enterprise Class Storage: One-Stop Solution From EMC for H...
Hadoop Analytics + Enterprise Class Storage: One-Stop Solution From EMC for H...Hadoop Analytics + Enterprise Class Storage: One-Stop Solution From EMC for H...
Hadoop Analytics + Enterprise Class Storage: One-Stop Solution From EMC for H...
 

Similar to Virtualized Big Data Platform at VMware Corp IT @ VMWorld 2015

Pivotal Real Time Data Stream Analytics
Pivotal Real Time Data Stream AnalyticsPivotal Real Time Data Stream Analytics
Pivotal Real Time Data Stream Analyticskgshukla
 
VMworld 2013: Big Data Platform Building Blocks: Serengeti, Resource Manageme...
VMworld 2013: Big Data Platform Building Blocks: Serengeti, Resource Manageme...VMworld 2013: Big Data Platform Building Blocks: Serengeti, Resource Manageme...
VMworld 2013: Big Data Platform Building Blocks: Serengeti, Resource Manageme...VMworld
 
VMworld 2013: Beyond Mission Critical: Virtualizing Big-Data, Hadoop, HPC, Cl...
VMworld 2013: Beyond Mission Critical: Virtualizing Big-Data, Hadoop, HPC, Cl...VMworld 2013: Beyond Mission Critical: Virtualizing Big-Data, Hadoop, HPC, Cl...
VMworld 2013: Beyond Mission Critical: Virtualizing Big-Data, Hadoop, HPC, Cl...VMworld
 
How the Development Bank of Singapore solves on-prem compute capacity challen...
How the Development Bank of Singapore solves on-prem compute capacity challen...How the Development Bank of Singapore solves on-prem compute capacity challen...
How the Development Bank of Singapore solves on-prem compute capacity challen...Alluxio, Inc.
 
Big Data Applications Made Easy: Fact Or Fiction?
Big Data Applications Made Easy: Fact Or Fiction?Big Data Applications Made Easy: Fact Or Fiction?
Big Data Applications Made Easy: Fact Or Fiction?Glenn Renfro
 
Vmware Serengeti - Based on Infochimps Ironfan
Vmware Serengeti - Based on Infochimps IronfanVmware Serengeti - Based on Infochimps Ironfan
Vmware Serengeti - Based on Infochimps IronfanJim Kaskade
 
Hadoop Summit San Jose 2015: What it Takes to Run Hadoop at Scale Yahoo Persp...
Hadoop Summit San Jose 2015: What it Takes to Run Hadoop at Scale Yahoo Persp...Hadoop Summit San Jose 2015: What it Takes to Run Hadoop at Scale Yahoo Persp...
Hadoop Summit San Jose 2015: What it Takes to Run Hadoop at Scale Yahoo Persp...Sumeet Singh
 
Gestione gerarchica dei dati con SUSE Enterprise Storage e HPE DMF
Gestione gerarchica dei dati con SUSE Enterprise Storage e HPE DMFGestione gerarchica dei dati con SUSE Enterprise Storage e HPE DMF
Gestione gerarchica dei dati con SUSE Enterprise Storage e HPE DMFSUSE Italy
 
Presentation architecting virtualized infrastructure for big data
Presentation   architecting virtualized infrastructure for big dataPresentation   architecting virtualized infrastructure for big data
Presentation architecting virtualized infrastructure for big datasolarisyourep
 
Presentation architecting virtualized infrastructure for big data
Presentation   architecting virtualized infrastructure for big dataPresentation   architecting virtualized infrastructure for big data
Presentation architecting virtualized infrastructure for big dataxKinAnx
 
Modernizing Your Data Warehouse using APS
Modernizing Your Data Warehouse using APSModernizing Your Data Warehouse using APS
Modernizing Your Data Warehouse using APSStéphane Fréchette
 
Trusted Analytics as a Service (BDT209) | AWS re:Invent 2013
Trusted Analytics as a Service (BDT209) | AWS re:Invent 2013Trusted Analytics as a Service (BDT209) | AWS re:Invent 2013
Trusted Analytics as a Service (BDT209) | AWS re:Invent 2013Amazon Web Services
 
Architecting virtualized infrastructure for big data presentation
Architecting virtualized infrastructure for big data presentationArchitecting virtualized infrastructure for big data presentation
Architecting virtualized infrastructure for big data presentationVlad Ponomarev
 
Ai tour 2019 Mejores Practicas en Entornos de Produccion Big Data Open Source...
Ai tour 2019 Mejores Practicas en Entornos de Produccion Big Data Open Source...Ai tour 2019 Mejores Practicas en Entornos de Produccion Big Data Open Source...
Ai tour 2019 Mejores Practicas en Entornos de Produccion Big Data Open Source...nnakasone
 
Modernizing Global Shared Data Analytics Platform and our Alluxio Journey
Modernizing Global Shared Data Analytics Platform and our Alluxio JourneyModernizing Global Shared Data Analytics Platform and our Alluxio Journey
Modernizing Global Shared Data Analytics Platform and our Alluxio JourneyAlluxio, Inc.
 
From limited Hadoop compute capacity to increased data scientist efficiency
From limited Hadoop compute capacity to increased data scientist efficiencyFrom limited Hadoop compute capacity to increased data scientist efficiency
From limited Hadoop compute capacity to increased data scientist efficiencyAlluxio, Inc.
 
Data Orchestration Platform for the Cloud
Data Orchestration Platform for the CloudData Orchestration Platform for the Cloud
Data Orchestration Platform for the CloudAlluxio, Inc.
 
Key trends in Big Data and new reference architecture from Hewlett Packard En...
Key trends in Big Data and new reference architecture from Hewlett Packard En...Key trends in Big Data and new reference architecture from Hewlett Packard En...
Key trends in Big Data and new reference architecture from Hewlett Packard En...Ontico
 
Replication in real-time from Oracle and MySQL into data warehouses and analy...
Replication in real-time from Oracle and MySQL into data warehouses and analy...Replication in real-time from Oracle and MySQL into data warehouses and analy...
Replication in real-time from Oracle and MySQL into data warehouses and analy...Continuent
 

Similar to Virtualized Big Data Platform at VMware Corp IT @ VMWorld 2015 (20)

Pivotal Real Time Data Stream Analytics
Pivotal Real Time Data Stream AnalyticsPivotal Real Time Data Stream Analytics
Pivotal Real Time Data Stream Analytics
 
VMworld 2013: Big Data Platform Building Blocks: Serengeti, Resource Manageme...
VMworld 2013: Big Data Platform Building Blocks: Serengeti, Resource Manageme...VMworld 2013: Big Data Platform Building Blocks: Serengeti, Resource Manageme...
VMworld 2013: Big Data Platform Building Blocks: Serengeti, Resource Manageme...
 
VMworld 2013: Beyond Mission Critical: Virtualizing Big-Data, Hadoop, HPC, Cl...
VMworld 2013: Beyond Mission Critical: Virtualizing Big-Data, Hadoop, HPC, Cl...VMworld 2013: Beyond Mission Critical: Virtualizing Big-Data, Hadoop, HPC, Cl...
VMworld 2013: Beyond Mission Critical: Virtualizing Big-Data, Hadoop, HPC, Cl...
 
How the Development Bank of Singapore solves on-prem compute capacity challen...
How the Development Bank of Singapore solves on-prem compute capacity challen...How the Development Bank of Singapore solves on-prem compute capacity challen...
How the Development Bank of Singapore solves on-prem compute capacity challen...
 
Big Data Applications Made Easy: Fact Or Fiction?
Big Data Applications Made Easy: Fact Or Fiction?Big Data Applications Made Easy: Fact Or Fiction?
Big Data Applications Made Easy: Fact Or Fiction?
 
Vmware Serengeti - Based on Infochimps Ironfan
Vmware Serengeti - Based on Infochimps IronfanVmware Serengeti - Based on Infochimps Ironfan
Vmware Serengeti - Based on Infochimps Ironfan
 
Empower Data-Driven Organizations with HPE and Hadoop
Empower Data-Driven Organizations with HPE and HadoopEmpower Data-Driven Organizations with HPE and Hadoop
Empower Data-Driven Organizations with HPE and Hadoop
 
Hadoop Summit San Jose 2015: What it Takes to Run Hadoop at Scale Yahoo Persp...
Hadoop Summit San Jose 2015: What it Takes to Run Hadoop at Scale Yahoo Persp...Hadoop Summit San Jose 2015: What it Takes to Run Hadoop at Scale Yahoo Persp...
Hadoop Summit San Jose 2015: What it Takes to Run Hadoop at Scale Yahoo Persp...
 
Gestione gerarchica dei dati con SUSE Enterprise Storage e HPE DMF
Gestione gerarchica dei dati con SUSE Enterprise Storage e HPE DMFGestione gerarchica dei dati con SUSE Enterprise Storage e HPE DMF
Gestione gerarchica dei dati con SUSE Enterprise Storage e HPE DMF
 
Presentation architecting virtualized infrastructure for big data
Presentation   architecting virtualized infrastructure for big dataPresentation   architecting virtualized infrastructure for big data
Presentation architecting virtualized infrastructure for big data
 
Presentation architecting virtualized infrastructure for big data
Presentation   architecting virtualized infrastructure for big dataPresentation   architecting virtualized infrastructure for big data
Presentation architecting virtualized infrastructure for big data
 
Modernizing Your Data Warehouse using APS
Modernizing Your Data Warehouse using APSModernizing Your Data Warehouse using APS
Modernizing Your Data Warehouse using APS
 
Trusted Analytics as a Service (BDT209) | AWS re:Invent 2013
Trusted Analytics as a Service (BDT209) | AWS re:Invent 2013Trusted Analytics as a Service (BDT209) | AWS re:Invent 2013
Trusted Analytics as a Service (BDT209) | AWS re:Invent 2013
 
Architecting virtualized infrastructure for big data presentation
Architecting virtualized infrastructure for big data presentationArchitecting virtualized infrastructure for big data presentation
Architecting virtualized infrastructure for big data presentation
 
Ai tour 2019 Mejores Practicas en Entornos de Produccion Big Data Open Source...
Ai tour 2019 Mejores Practicas en Entornos de Produccion Big Data Open Source...Ai tour 2019 Mejores Practicas en Entornos de Produccion Big Data Open Source...
Ai tour 2019 Mejores Practicas en Entornos de Produccion Big Data Open Source...
 
Modernizing Global Shared Data Analytics Platform and our Alluxio Journey
Modernizing Global Shared Data Analytics Platform and our Alluxio JourneyModernizing Global Shared Data Analytics Platform and our Alluxio Journey
Modernizing Global Shared Data Analytics Platform and our Alluxio Journey
 
From limited Hadoop compute capacity to increased data scientist efficiency
From limited Hadoop compute capacity to increased data scientist efficiencyFrom limited Hadoop compute capacity to increased data scientist efficiency
From limited Hadoop compute capacity to increased data scientist efficiency
 
Data Orchestration Platform for the Cloud
Data Orchestration Platform for the CloudData Orchestration Platform for the Cloud
Data Orchestration Platform for the Cloud
 
Key trends in Big Data and new reference architecture from Hewlett Packard En...
Key trends in Big Data and new reference architecture from Hewlett Packard En...Key trends in Big Data and new reference architecture from Hewlett Packard En...
Key trends in Big Data and new reference architecture from Hewlett Packard En...
 
Replication in real-time from Oracle and MySQL into data warehouses and analy...
Replication in real-time from Oracle and MySQL into data warehouses and analy...Replication in real-time from Oracle and MySQL into data warehouses and analy...
Replication in real-time from Oracle and MySQL into data warehouses and analy...
 

Recently uploaded

9654467111 Call Girls In Munirka Hotel And Home Service
9654467111 Call Girls In Munirka Hotel And Home Service9654467111 Call Girls In Munirka Hotel And Home Service
9654467111 Call Girls In Munirka Hotel And Home ServiceSapana Sha
 
Building on a FAIRly Strong Foundation to Connect Academic Research to Transl...
Building on a FAIRly Strong Foundation to Connect Academic Research to Transl...Building on a FAIRly Strong Foundation to Connect Academic Research to Transl...
Building on a FAIRly Strong Foundation to Connect Academic Research to Transl...Jack DiGiovanna
 
办理学位证中佛罗里达大学毕业证,UCF成绩单原版一比一
办理学位证中佛罗里达大学毕业证,UCF成绩单原版一比一办理学位证中佛罗里达大学毕业证,UCF成绩单原版一比一
办理学位证中佛罗里达大学毕业证,UCF成绩单原版一比一F sss
 
9711147426✨Call In girls Gurgaon Sector 31. SCO 25 escort service
9711147426✨Call In girls Gurgaon Sector 31. SCO 25 escort service9711147426✨Call In girls Gurgaon Sector 31. SCO 25 escort service
9711147426✨Call In girls Gurgaon Sector 31. SCO 25 escort servicejennyeacort
 
Identifying Appropriate Test Statistics Involving Population Mean
Identifying Appropriate Test Statistics Involving Population MeanIdentifying Appropriate Test Statistics Involving Population Mean
Identifying Appropriate Test Statistics Involving Population MeanMYRABACSAFRA2
 
RadioAdProWritingCinderellabyButleri.pdf
RadioAdProWritingCinderellabyButleri.pdfRadioAdProWritingCinderellabyButleri.pdf
RadioAdProWritingCinderellabyButleri.pdfgstagge
 
PKS-TGC-1084-630 - Stage 1 Proposal.pptx
PKS-TGC-1084-630 - Stage 1 Proposal.pptxPKS-TGC-1084-630 - Stage 1 Proposal.pptx
PKS-TGC-1084-630 - Stage 1 Proposal.pptxPramod Kumar Srivastava
 
1:1定制(UQ毕业证)昆士兰大学毕业证成绩单修改留信学历认证原版一模一样
1:1定制(UQ毕业证)昆士兰大学毕业证成绩单修改留信学历认证原版一模一样1:1定制(UQ毕业证)昆士兰大学毕业证成绩单修改留信学历认证原版一模一样
1:1定制(UQ毕业证)昆士兰大学毕业证成绩单修改留信学历认证原版一模一样vhwb25kk
 
RS 9000 Call In girls Dwarka Mor (DELHI)⇛9711147426🔝Delhi
RS 9000 Call In girls Dwarka Mor (DELHI)⇛9711147426🔝DelhiRS 9000 Call In girls Dwarka Mor (DELHI)⇛9711147426🔝Delhi
RS 9000 Call In girls Dwarka Mor (DELHI)⇛9711147426🔝Delhijennyeacort
 
GA4 Without Cookies [Measure Camp AMS]
GA4 Without Cookies [Measure Camp AMS]GA4 Without Cookies [Measure Camp AMS]
GA4 Without Cookies [Measure Camp AMS]📊 Markus Baersch
 
DBA Basics: Getting Started with Performance Tuning.pdf
DBA Basics: Getting Started with Performance Tuning.pdfDBA Basics: Getting Started with Performance Tuning.pdf
DBA Basics: Getting Started with Performance Tuning.pdfJohn Sterrett
 
High Class Call Girls Noida Sector 39 Aarushi 🔝8264348440🔝 Independent Escort...
High Class Call Girls Noida Sector 39 Aarushi 🔝8264348440🔝 Independent Escort...High Class Call Girls Noida Sector 39 Aarushi 🔝8264348440🔝 Independent Escort...
High Class Call Girls Noida Sector 39 Aarushi 🔝8264348440🔝 Independent Escort...soniya singh
 
dokumen.tips_chapter-4-transient-heat-conduction-mehmet-kanoglu.ppt
dokumen.tips_chapter-4-transient-heat-conduction-mehmet-kanoglu.pptdokumen.tips_chapter-4-transient-heat-conduction-mehmet-kanoglu.ppt
dokumen.tips_chapter-4-transient-heat-conduction-mehmet-kanoglu.pptSonatrach
 
Call Us ➥97111√47426🤳Call Girls in Aerocity (Delhi NCR)
Call Us ➥97111√47426🤳Call Girls in Aerocity (Delhi NCR)Call Us ➥97111√47426🤳Call Girls in Aerocity (Delhi NCR)
Call Us ➥97111√47426🤳Call Girls in Aerocity (Delhi NCR)jennyeacort
 
ASML's Taxonomy Adventure by Daniel Canter
ASML's Taxonomy Adventure by Daniel CanterASML's Taxonomy Adventure by Daniel Canter
ASML's Taxonomy Adventure by Daniel Cantervoginip
 
INTERNSHIP ON PURBASHA COMPOSITE TEX LTD
INTERNSHIP ON PURBASHA COMPOSITE TEX LTDINTERNSHIP ON PURBASHA COMPOSITE TEX LTD
INTERNSHIP ON PURBASHA COMPOSITE TEX LTDRafezzaman
 
20240419 - Measurecamp Amsterdam - SAM.pdf
20240419 - Measurecamp Amsterdam - SAM.pdf20240419 - Measurecamp Amsterdam - SAM.pdf
20240419 - Measurecamp Amsterdam - SAM.pdfHuman37
 
MK KOMUNIKASI DATA (TI)komdat komdat.docx
MK KOMUNIKASI DATA (TI)komdat komdat.docxMK KOMUNIKASI DATA (TI)komdat komdat.docx
MK KOMUNIKASI DATA (TI)komdat komdat.docxUnduhUnggah1
 
RABBIT: A CLI tool for identifying bots based on their GitHub events.
RABBIT: A CLI tool for identifying bots based on their GitHub events.RABBIT: A CLI tool for identifying bots based on their GitHub events.
RABBIT: A CLI tool for identifying bots based on their GitHub events.natarajan8993
 
Indian Call Girls in Abu Dhabi O5286O24O8 Call Girls in Abu Dhabi By Independ...
Indian Call Girls in Abu Dhabi O5286O24O8 Call Girls in Abu Dhabi By Independ...Indian Call Girls in Abu Dhabi O5286O24O8 Call Girls in Abu Dhabi By Independ...
Indian Call Girls in Abu Dhabi O5286O24O8 Call Girls in Abu Dhabi By Independ...dajasot375
 

Recently uploaded (20)

9654467111 Call Girls In Munirka Hotel And Home Service
9654467111 Call Girls In Munirka Hotel And Home Service9654467111 Call Girls In Munirka Hotel And Home Service
9654467111 Call Girls In Munirka Hotel And Home Service
 
Building on a FAIRly Strong Foundation to Connect Academic Research to Transl...
Building on a FAIRly Strong Foundation to Connect Academic Research to Transl...Building on a FAIRly Strong Foundation to Connect Academic Research to Transl...
Building on a FAIRly Strong Foundation to Connect Academic Research to Transl...
 
办理学位证中佛罗里达大学毕业证,UCF成绩单原版一比一
办理学位证中佛罗里达大学毕业证,UCF成绩单原版一比一办理学位证中佛罗里达大学毕业证,UCF成绩单原版一比一
办理学位证中佛罗里达大学毕业证,UCF成绩单原版一比一
 
9711147426✨Call In girls Gurgaon Sector 31. SCO 25 escort service
9711147426✨Call In girls Gurgaon Sector 31. SCO 25 escort service9711147426✨Call In girls Gurgaon Sector 31. SCO 25 escort service
9711147426✨Call In girls Gurgaon Sector 31. SCO 25 escort service
 
Identifying Appropriate Test Statistics Involving Population Mean
Identifying Appropriate Test Statistics Involving Population MeanIdentifying Appropriate Test Statistics Involving Population Mean
Identifying Appropriate Test Statistics Involving Population Mean
 
RadioAdProWritingCinderellabyButleri.pdf
RadioAdProWritingCinderellabyButleri.pdfRadioAdProWritingCinderellabyButleri.pdf
RadioAdProWritingCinderellabyButleri.pdf
 
PKS-TGC-1084-630 - Stage 1 Proposal.pptx
PKS-TGC-1084-630 - Stage 1 Proposal.pptxPKS-TGC-1084-630 - Stage 1 Proposal.pptx
PKS-TGC-1084-630 - Stage 1 Proposal.pptx
 
1:1定制(UQ毕业证)昆士兰大学毕业证成绩单修改留信学历认证原版一模一样
1:1定制(UQ毕业证)昆士兰大学毕业证成绩单修改留信学历认证原版一模一样1:1定制(UQ毕业证)昆士兰大学毕业证成绩单修改留信学历认证原版一模一样
1:1定制(UQ毕业证)昆士兰大学毕业证成绩单修改留信学历认证原版一模一样
 
RS 9000 Call In girls Dwarka Mor (DELHI)⇛9711147426🔝Delhi
RS 9000 Call In girls Dwarka Mor (DELHI)⇛9711147426🔝DelhiRS 9000 Call In girls Dwarka Mor (DELHI)⇛9711147426🔝Delhi
RS 9000 Call In girls Dwarka Mor (DELHI)⇛9711147426🔝Delhi
 
GA4 Without Cookies [Measure Camp AMS]
GA4 Without Cookies [Measure Camp AMS]GA4 Without Cookies [Measure Camp AMS]
GA4 Without Cookies [Measure Camp AMS]
 
DBA Basics: Getting Started with Performance Tuning.pdf
DBA Basics: Getting Started with Performance Tuning.pdfDBA Basics: Getting Started with Performance Tuning.pdf
DBA Basics: Getting Started with Performance Tuning.pdf
 
High Class Call Girls Noida Sector 39 Aarushi 🔝8264348440🔝 Independent Escort...
High Class Call Girls Noida Sector 39 Aarushi 🔝8264348440🔝 Independent Escort...High Class Call Girls Noida Sector 39 Aarushi 🔝8264348440🔝 Independent Escort...
High Class Call Girls Noida Sector 39 Aarushi 🔝8264348440🔝 Independent Escort...
 
dokumen.tips_chapter-4-transient-heat-conduction-mehmet-kanoglu.ppt
dokumen.tips_chapter-4-transient-heat-conduction-mehmet-kanoglu.pptdokumen.tips_chapter-4-transient-heat-conduction-mehmet-kanoglu.ppt
dokumen.tips_chapter-4-transient-heat-conduction-mehmet-kanoglu.ppt
 
Call Us ➥97111√47426🤳Call Girls in Aerocity (Delhi NCR)
Call Us ➥97111√47426🤳Call Girls in Aerocity (Delhi NCR)Call Us ➥97111√47426🤳Call Girls in Aerocity (Delhi NCR)
Call Us ➥97111√47426🤳Call Girls in Aerocity (Delhi NCR)
 
ASML's Taxonomy Adventure by Daniel Canter
ASML's Taxonomy Adventure by Daniel CanterASML's Taxonomy Adventure by Daniel Canter
ASML's Taxonomy Adventure by Daniel Canter
 
INTERNSHIP ON PURBASHA COMPOSITE TEX LTD
INTERNSHIP ON PURBASHA COMPOSITE TEX LTDINTERNSHIP ON PURBASHA COMPOSITE TEX LTD
INTERNSHIP ON PURBASHA COMPOSITE TEX LTD
 
20240419 - Measurecamp Amsterdam - SAM.pdf
20240419 - Measurecamp Amsterdam - SAM.pdf20240419 - Measurecamp Amsterdam - SAM.pdf
20240419 - Measurecamp Amsterdam - SAM.pdf
 
MK KOMUNIKASI DATA (TI)komdat komdat.docx
MK KOMUNIKASI DATA (TI)komdat komdat.docxMK KOMUNIKASI DATA (TI)komdat komdat.docx
MK KOMUNIKASI DATA (TI)komdat komdat.docx
 
RABBIT: A CLI tool for identifying bots based on their GitHub events.
RABBIT: A CLI tool for identifying bots based on their GitHub events.RABBIT: A CLI tool for identifying bots based on their GitHub events.
RABBIT: A CLI tool for identifying bots based on their GitHub events.
 
Indian Call Girls in Abu Dhabi O5286O24O8 Call Girls in Abu Dhabi By Independ...
Indian Call Girls in Abu Dhabi O5286O24O8 Call Girls in Abu Dhabi By Independ...Indian Call Girls in Abu Dhabi O5286O24O8 Call Girls in Abu Dhabi By Independ...
Indian Call Girls in Abu Dhabi O5286O24O8 Call Girls in Abu Dhabi By Independ...
 

Virtualized Big Data Platform at VMware Corp IT @ VMWorld 2015

  • 1. © 2014 VMware Inc. All rights reserved. Virtualized Big Data Platform @ VMware Corp IT Rajit Saha Hadoop Development Lead VMware Corp IT Data Solution and Delivery An Enterprise Data Warehouse meets an Elephant
  • 2. 2 Business Use Case for Big Data Analytics @ VMware BI Space Personalized Marketing & Customer Targeting Personalized Campaign Content Strategy MyVMware Log Analytics Combine User Level data - logins and other activities with Clickstream Data and Product Data VMware Product’s List Price Optimization and Deal Analytics for VMware Pricing Team - Complex ETL, Bigger Joins - Flattening Star Schema Tables - Propensity Modeling E D W - Deeper Learning of VMware Product Issues - Build highly intelligent recommendation System to fix Customer Issues with faster turn around time GSS Service Request Logs Analytics - High Volume ~ 400TB - A lot of Variety of data - Complex parsing Clickstream Data Analytics • Path analysis – First user visit to buy product • Propensity Modeling • Predictive Analytics - which product user will buy • Customer Lifetime Value Analysis 554 columns 1.5B Rows 20TB Data ( 2yrs) Variety Volume Velocity B I G D A T A
  • 3. 3 • This Big Data Cluster is fully Virtualized • based on vSphere 6.0 and VMware Big Data Extensions 2.2 • We used EMC Isilon 7.2.0.2 with two patches for HDFS Storage • We used Pivotal Big Data Suite 3.0 for Hadoop 2.6 and HAWQ 1.3 • We used Pivotal Spring XD 1.2 for Data Ingestion to Hadoop • We integrated this with Alpine Data Lab 5.4 for running • Deeper Analytic Functions • Machine Learning Algorithms • We integrated HUE 2.6 for GUI based HIVE/PIG Query execution client Components of Big Data Cluster
  • 4. 4 NAS Shared Storage [HadoopTempSpace] H S3 H S 4 PXFH A WQ Segment 1 H A WQ Segment 4 Pivotal Extension Framework ZK Zookeeper Server SC Spring XD Container N M YA RN Node M anager 10G Data Link I si lon H D F S I si lon I si lon I si lon I si lon · VMWare Big Data Extension 2.2 - provisions H adoop VM s · A pplication Stack: Pivotal HD (PHD 3.0), Spring XD 1.2, RabbitMQ 3.5.3, PostGres 9.4 · A nalytics Tool : Alpine Data Lab 5.4 · H DFS Storage: EMC Isilon 7.2.0.2 + Restricted Patch-14925 · H DFS Capacity : 30T · Temp Storage on : vmdks on VNX NAS · 4 HAWQ Segments & 4 Mapred Local Directories w ill be mounted on 4 VMDKs on NFS in Worker VMs HADOOP WORKER 1 8 vCPU & 52G RAM NM HS 4 HS 3 PXF 200G HS 1 HS 2 200G 200G 200G HADOOP WORKER 2 8 vCPU & 52G RAM NM HS 4 HS 3 PXF 200G HS 1 HS 2 200G 200G 200G HADOOP WORKER 3 8 vCPU & 52G RAM NM HS 4 HS 3 PXFZK 200G HS 1 HS 2 200G 200G 200G HADOOP WORKER 4 8 vCPU & 52G RAM NM HS 4 HS 3 PXFZK 200G HS 1 HS 2 200G 200G 200G HADOOP WORKER 5 8 vCPU & 52G RAM NM HS 4 HS 3 PXFZK 200G HS 1 HS 2 200G 200G 200G H S2 H S1 H A WQ Segment 2 H A WQ Segment 3 180G HADOOP MASTER 2 8 vCPU & 48G RAM HAWQ Master Standby RM History Server App Timeline Server 180G HADOOP MASTER 1 8 vCPU & 48G RAM Active RM HAWQ Master Standby Hive Server2 Hive -Metastore 180G HADOOP CLIENT 4 vCPU & 36G RAM Clients HCat, HDFS, Hive, MapReduce2, Pig , Tez, YARN, ZooKeeper Spring XD Admin POSTGRES RABBITMQ 200G MANAGEMENT 4 vCPU & 12G RAM NAGIOS AMBARI GANGLIA A LPIN E DA TA LAB(PROD) 8vCPU & 48G RA M 500G A LPIN E DA TA LAB (STAGE) 8vCPU & 48G RA M 500G VM w are Corp IT Big Data A nalytic Platform [ Production ] – A pplication A rchitecture Stack HUE SC Hive MySQL Web HCat Server SC
  • 5. 5 On-Prem Big Data Production Datacenter
  • 6. 6 Apache Ambari – The Hadoop Cluster Management Console Management & Monitoring - HDFS - Yarn/Map reduce - Hive - HAWQ - Spring XD
  • 7. Clickstream ftps.vmware.com raw data files firewall Daily push of Clickstream Logs Data Ingestion to Isilon HDFS via Spring XD Lookup Logs Clickstream Logs Adv. Analytics Users • Data Cleaning • Better Consumable Structured data • Data Partitioning • Schema Building • Faster Analytic Power - Daily 2M Clickstream Records ( ~10GB ) ares being ingested from Adobe Omniture to Isilon HDFS - 1.5Billion Records and 554 columns and ~20TB of data - Data Cleanup and Pre Processing using PIG, Hadoop Streaming and Python Scripts - Fit the Data into HIVE/HAWQ Schema - End Users ( Data Scientists ) consume via HUE/pgAdmin/Alpine Data Lab python Data Processing Pipeline – Click Stream Data
  • 8. 8 Data Consumption – pgAdmin3 ( via HAWQ Database) ….
  • 9. 9 And visualize the results .. 37% 7% 7%6% 6% 6% 4% 4% 3% 3% 2% 2% 2% 2% 2% 1% 1% 1% 1% 1% Top 20 Countries with unique vmware.com Visits on 2015 Q1 usa jpn deu gbr chn ind can fra aus kor esp bra 34% 7% 7% 6% 10% 6% 3% 3% 2% 4% 3% 3% 2% 2% 2% 1% 1% 1% 1% 1% Top 20 Countries with unique vmware.com Visitors on 2015 Q1 usa jpn deu gbr chn ind can fra aus kor esp bra ita nld rus che twn pol mex swe Disclaimer : This is based on Synthesized Dataset for demo purpose, not Real Data
  • 10. 10 Data Consumption – HUE Hive Query to find out unique visits in VMware site 2015 Q1 0 2000000 4000000 6000000 8000000 10000000 12000000 14000000 VisitCount Month Unique Visits in 2014 and 2015 month wise visits Disclaimer : This is based on Synthesized Dataset for demo purpose, not Real Data
  • 11. 11 Advanced Data Analytics by Alpine Data Lab Time Series Analysis on Jan 2015 Clickstream Data
  • 12. 12 At VMware IT, we have established the fact that an Enterprise Big Data Analytics Platform can be successfully built and run on top of VMware Virtual Infrastructure with EMC Isilon and PHD 3.0 -with great performance Take Away …