0
Pivotal: Hadoop for
Powerful Processing of
Unstructured Data for
Valuable Insights
SK Krishnamurthy

skrishnamurthy@gopivo...
Traditional Enterprise Analytics Process

© Copyright 2013 EMC Corporation. All rights reserved.

2
The Fundamental Paradigm Shift
 Internet age and exploding data growth
 Enterprises leverage new data sources to identif...
Enter Hadoop
 Flexible
 Scalable
 Inexpensive

Platform for Big

 Fault-tolerant

Data

 Rapidly Adopted

© Copyright...
Evolution of Process with Hadoop

© Copyright 2013 EMC Corporation. All rights reserved.

5
HDFS Economics Have Changed the Game
Big Data Platform Price/TB

$80,000

Big Data RDBMS pricing will
ultimately converge ...
Where We’re Going

© Copyright 2013 EMC Corporation. All rights reserved.
© Copyright 2013 Pivotal. All rights reserved.

...
Big Data Platform
Pivotal Data Platform
Stream
Ingestion
Streaming Services

Data Staging
Platform
Data Mgmt. Services

Op...
Flexible Deployment Model
Portable
Elastic
HW Abstracted
Manageable
“Consumer” grade

deploy

Private Cloud

© Copyright 2...
PIVOTAL HD
The world’s most powerful
Hadoop distribution

© Copyright 2013 EMC Corporation. All rights reserved.

10
Pivotal HD
 World’s first true SQL processing for enterprise-ready
Hadoop
 100% Apache Hadoop-based platform
 Virtualiz...
Pivotal Hadoop Distributions
GPHD

Pivotal HD

Apache Hadoop 1.x

Apache Hadoop 2.x

100% Open Source Compatible

© Copyri...
Pivotal HD Components
• HDFS – The Hadoop Distributed File
System acts as the storage layer for
Hadoop

• Pig – High-level...
Pivotal HD Value-Added Components
GPHD Includes…
• Installation and Configuration Manager (ICM)
– cluster installation, up...
Pivotal Core Components & Versions
GPHD 1.2 Core Distribution

Pivotal HD Enterprise

Component

Version

Component

Versi...
Pivotal HD Architecture

Resource
Management
& Workflow

Pig, Hive,
Mahout

HBase

Map Reduce

Yarn

HDFS

Zookeeper
Sqoop...
Pivotal HD Architecture
Pivotal HD
Enterprise
Resource
Management
& Workflow

Pig, Hive,
Mahout

HBase

Map Reduce
Hadoop ...
Pivotal HD Architecture
HAWQ– Advanced
Database Services
ANSI SQL + Analytics

Pivotal HD
Enterprise
Resource
Management
&...
DataLoader
Streams

DataLoader

Pull
Push

Web GUI and CLI

Connectors

Flume
Files

Data Source
Registration

Job
Managem...
Command Center

Simple and complete cluster management
 Install and configure Hadoop
components and services
 Centralize...
Command Center – Monitor, Manage,
and Analyze
 Host, application, and job level
monitoring across the entire
Pivotal HD c...
Hadoop Virtualization Extensions (HVE)
• HVE enables Hadoop to support more effective virtual deployments
• This creates t...
HAWQ
© Copyright 2013 EMC Corporation. All rights reserved.
© Copyright 2013 Pivotal. All rights reserved.

23
23
HAWQ: The Crown Jewels of Greenplum
 SQL compliant
 World-class query optimizer
 Interactive query
 Horizontal scalabi...
HAWQ

High-Performance Query Processing

 Interactive and true ANSI SQL support
 Multi-petabyte horizontal scalability
...
HAWQ

Enterprise-Class Database Services & Management

 Scatter-gather data loading
 Row and column storage
 Workload m...
HAWQ

Pre-integrated Deep Analytics

 Performance via fully parallelized implementation
 Consistent, user friendly SQL i...
GPDB – Components
GPDB

Resource Management

Query Engine

Catalog Service

Planner

Optimizer

Executor

Transaction
Mana...
HAWQ – Components

Resource
Management

GPSQL
Query Engine
Planner

Optimizer

Executor

Catalog Service

Transaction
Mana...
How HAWQ Works
Clients

SELECT beer, price
FROM Bars b, Sells s
WHERE b.name = s.bar
AND b.city = ‘San Francisco’

HAWQ Ma...
How HAWQ Works
Clients

Optimization
Context
Parse Tree

HAWQ Master Host

Metadata

Query Parser

JDBC/ODBC
SQL Console

...
How HAWQ Works

Execution Plan

Clients

HAWQ Master Host
Query Parser

JDBC/ODBC
SQL Console

Query Optimizer
HDFS Nameno...
How HAWQ Works
Clients

HAWQ Master Host
Query Parser

JDBC/ODBC
SQL Console

Query Optimizer
HDFS Namenode

HAWQ Segment
...
How HAWQ Works
Clients

HAWQ Master Host
Query Parser

JDBC/ODBC

Query Optimizer

SQL Console

HAWQ Segment
Host
Query Ex...
How HAWQ Works
Clients

HAWQ Master Host
Query Parser

JDBC/ODBC
SQL Console

Query Optimizer
HDFS Namenode

HAWQ Segment
...
HAWQ Deployment
ODBC/JDBC Driver

Master
Servers & Name
Nodes

...

...

Query planning & dispatch

Dynamic
Pipelining

Se...
Xtension Framework
 An advanced version of GPDB
external tables
 Enables combining HAWQ data
and Hadoop data in single q...
HAWQ Benchmarks

User intelligence

4.2

198

47X

Sales analysis

8.7

161

19X

Click analysis

2.0

415

208X

Data exp...
Pivotal Analytics Workbench (AWB)
Commitment to Accelerating Innovation &
Contributing to the Apache Community
• Multi-mil...
“Real” Hadoop Cluster

© Copyright 2013 EMC Corporation. All rights reserved.

40
Leveraging Full Power of the Family

© Copyright 2013 EMC Corporation. All rights reserved.

41
Pivotal Sessions at EMC World
Session

Presenter

Dates/Times

The Pivotal Platform: A Purpose-Built Platform for Big-Data...
Pivotal: Hadoop for Powerful Processing of Unstructured Data for Valuable Insights
Upcoming SlideShare
Loading in...5
×

Pivotal: Hadoop for Powerful Processing of Unstructured Data for Valuable Insights

1,728

Published on

Pivotal has setup and operationalized 1000 node Hadoop cluster called the Analytics Workbench. It takes special setup and skills to manage such a large deployment. This session shares how we set it up and how you will manage it.


Objective 1: Understand what it takes to operationalize a 1000-nodeHadoop cluster.
After this session you will be able to:
Objective 2: Understand how to set up and manage the day to day challenges of a large Hadoop deployments.
Objective 3: Have a view to the tools that are necessary to solve the challenges of managing the large Hadoop cluster.

Published in: Technology

Transcript of "Pivotal: Hadoop for Powerful Processing of Unstructured Data for Valuable Insights"

  1. 1. Pivotal: Hadoop for Powerful Processing of Unstructured Data for Valuable Insights SK Krishnamurthy skrishnamurthy@gopivotal.com © Copyright 2013 EMC Corporation. All rights reserved. 1
  2. 2. Traditional Enterprise Analytics Process © Copyright 2013 EMC Corporation. All rights reserved. 2
  3. 3. The Fundamental Paradigm Shift  Internet age and exploding data growth  Enterprises leverage new data sources to identify emerging trends and opportunities  Traditional database tools not able to cope © Copyright 2013 EMC Corporation. All rights reserved. 3
  4. 4. Enter Hadoop  Flexible  Scalable  Inexpensive Platform for Big  Fault-tolerant Data  Rapidly Adopted © Copyright 2013 EMC Corporation. All rights reserved. 4
  5. 5. Evolution of Process with Hadoop © Copyright 2013 EMC Corporation. All rights reserved. 5
  6. 6. HDFS Economics Have Changed the Game Big Data Platform Price/TB $80,000 Big Data RDBMS pricing will ultimately converge with Hadoop pricing $60,000 The price per TB of Big Data RDMBS has been consistently eroding over time. $40,000 Hadoop pricing has increased slightly over time as vendors have injected value added services into the ecosystem. $20,000 $- 2008 2009 2010 Big Data DB © Copyright 2013 EMC Corporation. All rights reserved. 2011 2012 2013 Hadoop 6
  7. 7. Where We’re Going © Copyright 2013 EMC Corporation. All rights reserved. © Copyright 2013 Pivotal. All rights reserved. 7
  8. 8. Big Data Platform Pivotal Data Platform Stream Ingestion Streaming Services Data Staging Platform Data Mgmt. Services Operational Intelligence Run-Time Applications In-Memory DB Analytical Query In-Memory Objects HDFS Enterprise Data Warehouse RDBMS © Copyright 2013 EMC Corporation. All rights reserved. Continues to serve as system of record Traditional BI/Reporting Data Visualization Compliance and financial reporting 8
  9. 9. Flexible Deployment Model Portable Elastic HW Abstracted Manageable “Consumer” grade deploy Private Cloud © Copyright 2013 EMC Corporation. All rights reserved. On Premise Public Cloud 9
  10. 10. PIVOTAL HD The world’s most powerful Hadoop distribution © Copyright 2013 EMC Corporation. All rights reserved. 10
  11. 11. Pivotal HD  World’s first true SQL processing for enterprise-ready Hadoop  100% Apache Hadoop-based platform  Virtualization and cloud ready with VMWare and Isilon  Scale tested in 1000 node Pivotal Analytics Workbench  Available as a software-only or appliance-based solution  Backed by EMC’s global, 24x7 support infrastructure © Copyright 2013 EMC Corporation. All rights reserved. 11
  12. 12. Pivotal Hadoop Distributions GPHD Pivotal HD Apache Hadoop 1.x Apache Hadoop 2.x 100% Open Source Compatible © Copyright 2013 EMC Corporation. All rights reserved. 12
  13. 13. Pivotal HD Components • HDFS – The Hadoop Distributed File System acts as the storage layer for Hadoop • Pig – High-level procedural language for data pipeline/data flow processing in Hadoop • MapReduce – Parallel processing framework used for data computation in Hadoop • HBase – NoSQL, key-value data store on top of HDFS • Hive – Structured, data warehouse implementation for data in HDFS that provides a SQL-like interface to Hadoop © Copyright 2013 EMC Corporation. All rights reserved. • Mahout – Library of scalable machinelearning Algorithms • Spring Hadoop – Integrates the Spring framework into Hadoop 13
  14. 14. Pivotal HD Value-Added Components GPHD Includes… • Installation and Configuration Manager (ICM) – cluster installation, upgrade, and expansion tools. • GP Command Center – visual interface for cluster health, system metrics, and job monitoring. • Hadoop Virtualization Extension (HVE) – enhances Hadoop to support virtual node awareness and enables greater cluster elasticity. • GP Data Loader – parallel loading infrastructure that supports “line speed” data loading into HDFS. Pivotal HD Adds the Following to GPHD… • Advanced Database Services (HAWQ)– highperformance, “True SQL” query interface running within the Hadoop cluster. • Extensions Framework (GPXF) – support for HAWQ interfaces on external data providers (HBase, Avro, etc.). • Advanced Analytics Functions (MADLib) – ability to access parallelized machine-learning and datamining functions at scale. • Isilon Integration – extensively tested at scale with guidelines for compute-heavy, storage-heavy, and balanced configurations. © Copyright 2013 EMC Corporation. All rights reserved. 14
  15. 15. Pivotal Core Components & Versions GPHD 1.2 Core Distribution Pivotal HD Enterprise Component Version Component Version Hadoop 1.0.3 Hadoop 2.0.2 HBase 0.92.1 HBase 0.94.2 Hive 0.8.1 Hive 0.9.1 Mahout 0.6 Mahout 0.8.0 Pig 0.9.2 Pig 0.10.0 Zookeeper 3.3.5 Zookeeper 3.4.3 Flume 1.2.0 Flume 1.2.0 Sqoop 1.4.1 Sqoop 1.4.1 Spring Hadoop © Copyright 2013 EMC Corporation. All rights reserved. Spring Hadoop 15
  16. 16. Pivotal HD Architecture Resource Management & Workflow Pig, Hive, Mahout HBase Map Reduce Yarn HDFS Zookeeper Sqoop Flume Apache © Copyright 2013 EMC Corporation. All rights reserved. 16
  17. 17. Pivotal HD Architecture Pivotal HD Enterprise Resource Management & Workflow Pig, Hive, Mahout HBase Map Reduce Hadoop Virtualization (HVE) Yarn HDFS Zookeeper Sqoop Apache © Copyright 2013 EMC Corporation. All rights reserved. Data Loader Deploy, Configure, Monitor, Manage Command Flume Center Pivotal HD Enterprise 17
  18. 18. Pivotal HD Architecture HAWQ– Advanced Database Services ANSI SQL + Analytics Pivotal HD Enterprise Resource Management & Workflow Xtension Framework HBase Query Optimizer Dynamic Pipelining Pig, Hive, Mahout Map Reduce Hadoop Virtualization (HVE) Yarn HDFS Zookeeper Sqoop Apache © Copyright 2013 EMC Corporation. All rights reserved. Catalog Services Command Center Flume Data Loader Pivotal HD Enterprise Deploy, Configure, Monitor, Manage HAWQ 18
  19. 19. DataLoader Streams DataLoader Pull Push Web GUI and CLI Connectors Flume Files Data Source Registration Job Management Data Destination Registration Copy Strategy Optimization Data Processing Data Copy HDFS HDFS NFS HTTP FTP Local © Copyright 2013 EMC Corporation. All rights reserved. REST APIs . . 19
  20. 20. Command Center Simple and complete cluster management  Install and configure Hadoop components and services  Centralized interface for Pivotal HD cluster monitoring, diagnostics, and management  Live and historical Hadoop system metrics analysis © Copyright 2013 EMC Corporation. All rights reserved. Deploy Configure Analyze Monitor Manage 20
  21. 21. Command Center – Monitor, Manage, and Analyze  Host, application, and job level monitoring across the entire Pivotal HD cluster performance  Visualize and analyze live and historical Hadoop cluster information through Command Center Dashboard  Quick diagnostics of functional or performance issue © Copyright 2013 EMC Corporation. All rights reserved. 21
  22. 22. Hadoop Virtualization Extensions (HVE) • HVE enables Hadoop to support more effective virtual deployments • This creates the opportunity to provision and scale the compute and storage processes independently resulting in: • Much better resource utilization • Improved resource allocation and consumption • Support Multi-Tenancy © Copyright 2013 EMC Corporation. All rights reserved. 22
  23. 23. HAWQ © Copyright 2013 EMC Corporation. All rights reserved. © Copyright 2013 Pivotal. All rights reserved. 23 23
  24. 24. HAWQ: The Crown Jewels of Greenplum  SQL compliant  World-class query optimizer  Interactive query  Horizontal scalability  Robust data management  Common Hadoop formats  Deep analytics © Copyright 2013 EMC Corporation. All rights reserved. 24
  25. 25. HAWQ High-Performance Query Processing  Interactive and true ANSI SQL support  Multi-petabyte horizontal scalability  Cost-based parallel query optimizer  Programmable analytics © Copyright 2013 EMC Corporation. All rights reserved. 25
  26. 26. HAWQ Enterprise-Class Database Services & Management  Scatter-gather data loading  Row and column storage  Workload management  Multi-level partitioning  3rd-party tool & open client interfaces © Copyright 2013 EMC Corporation. All rights reserved. 26
  27. 27. HAWQ Pre-integrated Deep Analytics  Performance via fully parallelized implementation  Consistent, user friendly SQL interfaces  Ease of data preparation  Pre-integrated MADLib support – Linear Regression – Logistic Regression – Multinomial Logisitic Regression © Copyright 2013 EMC Corporation. All rights reserved. – K-Means – Association Rules – PLDA - useful for topic modeling 27
  28. 28. GPDB – Components GPDB Resource Management Query Engine Catalog Service Planner Optimizer Executor Transaction Manager © Copyright 2013 EMC Corporation. All rights reserved. GPXF Local File System 28
  29. 29. HAWQ – Components Resource Management GPSQL Query Engine Planner Optimizer Executor Catalog Service Transaction Manager GPXF HDFS © Copyright 2013 EMC Corporation. All rights reserved. 29
  30. 30. How HAWQ Works Clients SELECT beer, price FROM Bars b, Sells s WHERE b.name = s.bar AND b.city = ‘San Francisco’ HAWQ Master Host Query Parser JDBC/ODBC SQL Console Query Optimizer HDFS Namenode HAWQ Segment Host Query Executor HAWQ Segment Host Query Executor HAWQ Segment Host Query Executor HDFS Datanode HDFS Datanode HDFS Datanode © Copyright 2013 EMC Corporation. All rights reserved. ... 30
  31. 31. How HAWQ Works Clients Optimization Context Parse Tree HAWQ Master Host Metadata Query Parser JDBC/ODBC SQL Console Query Optimizer HDFS Namenode Cost Model Resources HAWQ Segment Host Query Executor HAWQ Segment Host Query Executor HAWQ Segment Host Query Executor HDFS Datanode HDFS Datanode HDFS Datanode © Copyright 2013 EMC Corporation. All rights reserved. ... 31
  32. 32. How HAWQ Works Execution Plan Clients HAWQ Master Host Query Parser JDBC/ODBC SQL Console Query Optimizer HDFS Namenode HAWQ Segment Host Query Executor HAWQ Segment Host Query Executor HAWQ Segment Host Query Executor HDFS Datanode HDFS Datanode HDFS Datanode © Copyright 2013 EMC Corporation. All rights reserved. ... 32
  33. 33. How HAWQ Works Clients HAWQ Master Host Query Parser JDBC/ODBC SQL Console Query Optimizer HDFS Namenode HAWQ Segment Host Query Executor HAWQ Segment Host Query Executor HAWQ Segment Host Query Executor HDFS Datanode HDFS Datanode HDFS Datanode © Copyright 2013 EMC Corporation. All rights reserved. ... 33
  34. 34. How HAWQ Works Clients HAWQ Master Host Query Parser JDBC/ODBC Query Optimizer SQL Console HAWQ Segment Host Query Executor HDFS Namenode HAWQ Segment Host Query Executor D y n a m i c HDFS Datanode © Copyright 2013 EMC Corporation. All rights reserved. HAWQ Segment Host Query Executor P i p e l i n i n g ™ HDFS Datanode ... HDFS Datanode 34
  35. 35. How HAWQ Works Clients HAWQ Master Host Query Parser JDBC/ODBC SQL Console Query Optimizer HDFS Namenode HAWQ Segment Host Query Executor HAWQ Segment Host Query Executor HAWQ Segment Host Query Executor HDFS Datanode HDFS Datanode HDFS Datanode © Copyright 2013 EMC Corporation. All rights reserved. ... 35
  36. 36. HAWQ Deployment ODBC/JDBC Driver Master Servers & Name Nodes ... ... Query planning & dispatch Dynamic Pipelining Segment Servers & Data Nodes ... Query processing & data storage ... HDFS External Sources Loading, streaming, etc. © Copyright 2013 EMC Corporation. All rights reserved. 36
  37. 37. Xtension Framework  An advanced version of GPDB external tables  Enables combining HAWQ data and Hadoop data in single query Xtension Framework HDFS HBase © Copyright 2013 EMC Corporation. All rights reserved. Hive  Supports connectors for HDFS, Hbase and Hive  Provides extensible framework API to enable custom connector development for other data sources 37
  38. 38. HAWQ Benchmarks User intelligence 4.2 198 47X Sales analysis 8.7 161 19X Click analysis 2.0 415 208X Data exploration 2.7 1,285 476X BI drill down 2.8 1,815 648X © Copyright 2013 EMC Corporation. All rights reserved. 38
  39. 39. Pivotal Analytics Workbench (AWB) Commitment to Accelerating Innovation & Contributing to the Apache Community • Multi-million dollar investment by Pivotal and partners in a 1,000-node, 24-Petabyte cluster to facilitate innovation and conduct regular integration/scale testing of Apache Hadoop • Full-time, dedicated integration onboarding projects and validating each release of Apache Hadoop at-scale • Contributing back our results and findings to the open source community as well as incorporating them into the continued development of Pivotal HD © Copyright 2013 EMC Corporation. All rights reserved. 39
  40. 40. “Real” Hadoop Cluster © Copyright 2013 EMC Corporation. All rights reserved. 40
  41. 41. Leveraging Full Power of the Family © Copyright 2013 EMC Corporation. All rights reserved. 41
  42. 42. Pivotal Sessions at EMC World Session Presenter Dates/Times The Pivotal Platform: A Purpose-Built Platform for Big-DataDriven Applications Josh Klahr Tue 5:30 - 6:30, Palazzo E Wed 11:30 - 12:30, Delfino 4005 Pivotal: Data Scientists on the Front Line: Examples of Data Science in Action Noelle Sio Tue 10:00 - 11:00, Lando 4205 Thu 8:30 - 9:30, Palazzo F Pivotal: Operationalizing 1000-node Hadoop Cluster – Analytics Workbench Clinton Ooi Bhavin Modi Tue 11:30 - 12:30, Palazzo L Thu 10:00- 11:00 am, Delfino 4001A Pivotal: for Powerful Processing of Unstructured Data For Valuable Insights SK Krishnamurthy Mon 4:00 - 5:00, Lando 4201 A Tue 4:00 - 5:00, Palazzo M Pivotal: Big & Fast data – merging real-time data and deep analytics Michael Crutcher Mon 1:00 - 2:00, Lando 4201 A Wed 10:00 - 11:00, Palazzo M Pivotal: Virtualize Big Data to Make The Elephant Dance June Yang Dan Baskette Mon 11:30 - 12:30, Marcello 4401A Wed 4:00 - 5:00, Palazzo E Hadoop Design Patterns Don Miner Mon 2:30 - 3:30, Palazzo F Wed 8:30 - 9:30, Delfino 4005 © Copyright 2013 EMC Corporation. All rights reserved. 42
  1. A particular slide catching your eye?

    Clipping is a handy way to collect important slides you want to go back to later.

×