Pivotal: Hadoop for Powerful Processing of Unstructured Data for Valuable Insights
 

Pivotal: Hadoop for Powerful Processing of Unstructured Data for Valuable Insights

on

  • 1,110 views

Pivotal has setup and operationalized 1000 node Hadoop cluster called the Analytics Workbench. It takes special setup and skills to manage such a large deployment. This session shares how we set it up ...

Pivotal has setup and operationalized 1000 node Hadoop cluster called the Analytics Workbench. It takes special setup and skills to manage such a large deployment. This session shares how we set it up and how you will manage it.


Objective 1: Understand what it takes to operationalize a 1000-nodeHadoop cluster.
After this session you will be able to:
Objective 2: Understand how to set up and manage the day to day challenges of a large Hadoop deployments.
Objective 3: Have a view to the tools that are necessary to solve the challenges of managing the large Hadoop cluster.

Statistics

Views

Total Views
1,110
Views on SlideShare
1,106
Embed Views
4

Actions

Likes
4
Downloads
61
Comments
0

1 Embed 4

https://twitter.com 4

Accessibility

Categories

Upload Details

Uploaded via as Adobe PDF

Usage Rights

© All Rights Reserved

Report content

Flagged as inappropriate Flag as inappropriate
Flag as inappropriate

Select your reason for flagging this presentation as inappropriate.

Cancel
  • Full Name Full Name Comment goes here.
    Are you sure you want to
    Your message goes here
    Processing…
Post Comment
Edit your comment

    Pivotal: Hadoop for Powerful Processing of Unstructured Data for Valuable Insights Pivotal: Hadoop for Powerful Processing of Unstructured Data for Valuable Insights Presentation Transcript

    • Pivotal: Hadoop for Powerful Processing of Unstructured Data for Valuable Insights SK Krishnamurthy skrishnamurthy@gopivotal.com © Copyright 2013 EMC Corporation. All rights reserved. 1
    • Traditional Enterprise Analytics Process © Copyright 2013 EMC Corporation. All rights reserved. 2
    • The Fundamental Paradigm Shift  Internet age and exploding data growth  Enterprises leverage new data sources to identify emerging trends and opportunities  Traditional database tools not able to cope © Copyright 2013 EMC Corporation. All rights reserved. 3
    • Enter Hadoop  Flexible  Scalable  Inexpensive Platform for Big  Fault-tolerant Data  Rapidly Adopted © Copyright 2013 EMC Corporation. All rights reserved. 4
    • Evolution of Process with Hadoop © Copyright 2013 EMC Corporation. All rights reserved. 5
    • HDFS Economics Have Changed the Game Big Data Platform Price/TB $80,000 Big Data RDBMS pricing will ultimately converge with Hadoop pricing $60,000 The price per TB of Big Data RDMBS has been consistently eroding over time. $40,000 Hadoop pricing has increased slightly over time as vendors have injected value added services into the ecosystem. $20,000 $- 2008 2009 2010 Big Data DB © Copyright 2013 EMC Corporation. All rights reserved. 2011 2012 2013 Hadoop 6
    • Where We’re Going © Copyright 2013 EMC Corporation. All rights reserved. © Copyright 2013 Pivotal. All rights reserved. 7
    • Big Data Platform Pivotal Data Platform Stream Ingestion Streaming Services Data Staging Platform Data Mgmt. Services Operational Intelligence Run-Time Applications In-Memory DB Analytical Query In-Memory Objects HDFS Enterprise Data Warehouse RDBMS © Copyright 2013 EMC Corporation. All rights reserved. Continues to serve as system of record Traditional BI/Reporting Data Visualization Compliance and financial reporting 8
    • Flexible Deployment Model Portable Elastic HW Abstracted Manageable “Consumer” grade deploy Private Cloud © Copyright 2013 EMC Corporation. All rights reserved. On Premise Public Cloud 9
    • PIVOTAL HD The world’s most powerful Hadoop distribution © Copyright 2013 EMC Corporation. All rights reserved. 10
    • Pivotal HD  World’s first true SQL processing for enterprise-ready Hadoop  100% Apache Hadoop-based platform  Virtualization and cloud ready with VMWare and Isilon  Scale tested in 1000 node Pivotal Analytics Workbench  Available as a software-only or appliance-based solution  Backed by EMC’s global, 24x7 support infrastructure © Copyright 2013 EMC Corporation. All rights reserved. 11
    • Pivotal Hadoop Distributions GPHD Pivotal HD Apache Hadoop 1.x Apache Hadoop 2.x 100% Open Source Compatible © Copyright 2013 EMC Corporation. All rights reserved. 12
    • Pivotal HD Components • HDFS – The Hadoop Distributed File System acts as the storage layer for Hadoop • Pig – High-level procedural language for data pipeline/data flow processing in Hadoop • MapReduce – Parallel processing framework used for data computation in Hadoop • HBase – NoSQL, key-value data store on top of HDFS • Hive – Structured, data warehouse implementation for data in HDFS that provides a SQL-like interface to Hadoop © Copyright 2013 EMC Corporation. All rights reserved. • Mahout – Library of scalable machinelearning Algorithms • Spring Hadoop – Integrates the Spring framework into Hadoop 13
    • Pivotal HD Value-Added Components GPHD Includes… • Installation and Configuration Manager (ICM) – cluster installation, upgrade, and expansion tools. • GP Command Center – visual interface for cluster health, system metrics, and job monitoring. • Hadoop Virtualization Extension (HVE) – enhances Hadoop to support virtual node awareness and enables greater cluster elasticity. • GP Data Loader – parallel loading infrastructure that supports “line speed” data loading into HDFS. Pivotal HD Adds the Following to GPHD… • Advanced Database Services (HAWQ)– highperformance, “True SQL” query interface running within the Hadoop cluster. • Extensions Framework (GPXF) – support for HAWQ interfaces on external data providers (HBase, Avro, etc.). • Advanced Analytics Functions (MADLib) – ability to access parallelized machine-learning and datamining functions at scale. • Isilon Integration – extensively tested at scale with guidelines for compute-heavy, storage-heavy, and balanced configurations. © Copyright 2013 EMC Corporation. All rights reserved. 14
    • Pivotal Core Components & Versions GPHD 1.2 Core Distribution Pivotal HD Enterprise Component Version Component Version Hadoop 1.0.3 Hadoop 2.0.2 HBase 0.92.1 HBase 0.94.2 Hive 0.8.1 Hive 0.9.1 Mahout 0.6 Mahout 0.8.0 Pig 0.9.2 Pig 0.10.0 Zookeeper 3.3.5 Zookeeper 3.4.3 Flume 1.2.0 Flume 1.2.0 Sqoop 1.4.1 Sqoop 1.4.1 Spring Hadoop © Copyright 2013 EMC Corporation. All rights reserved. Spring Hadoop 15
    • Pivotal HD Architecture Resource Management & Workflow Pig, Hive, Mahout HBase Map Reduce Yarn HDFS Zookeeper Sqoop Flume Apache © Copyright 2013 EMC Corporation. All rights reserved. 16
    • Pivotal HD Architecture Pivotal HD Enterprise Resource Management & Workflow Pig, Hive, Mahout HBase Map Reduce Hadoop Virtualization (HVE) Yarn HDFS Zookeeper Sqoop Apache © Copyright 2013 EMC Corporation. All rights reserved. Data Loader Deploy, Configure, Monitor, Manage Command Flume Center Pivotal HD Enterprise 17
    • Pivotal HD Architecture HAWQ– Advanced Database Services ANSI SQL + Analytics Pivotal HD Enterprise Resource Management & Workflow Xtension Framework HBase Query Optimizer Dynamic Pipelining Pig, Hive, Mahout Map Reduce Hadoop Virtualization (HVE) Yarn HDFS Zookeeper Sqoop Apache © Copyright 2013 EMC Corporation. All rights reserved. Catalog Services Command Center Flume Data Loader Pivotal HD Enterprise Deploy, Configure, Monitor, Manage HAWQ 18
    • DataLoader Streams DataLoader Pull Push Web GUI and CLI Connectors Flume Files Data Source Registration Job Management Data Destination Registration Copy Strategy Optimization Data Processing Data Copy HDFS HDFS NFS HTTP FTP Local © Copyright 2013 EMC Corporation. All rights reserved. REST APIs . . 19
    • Command Center Simple and complete cluster management  Install and configure Hadoop components and services  Centralized interface for Pivotal HD cluster monitoring, diagnostics, and management  Live and historical Hadoop system metrics analysis © Copyright 2013 EMC Corporation. All rights reserved. Deploy Configure Analyze Monitor Manage 20
    • Command Center – Monitor, Manage, and Analyze  Host, application, and job level monitoring across the entire Pivotal HD cluster performance  Visualize and analyze live and historical Hadoop cluster information through Command Center Dashboard  Quick diagnostics of functional or performance issue © Copyright 2013 EMC Corporation. All rights reserved. 21
    • Hadoop Virtualization Extensions (HVE) • HVE enables Hadoop to support more effective virtual deployments • This creates the opportunity to provision and scale the compute and storage processes independently resulting in: • Much better resource utilization • Improved resource allocation and consumption • Support Multi-Tenancy © Copyright 2013 EMC Corporation. All rights reserved. 22
    • HAWQ © Copyright 2013 EMC Corporation. All rights reserved. © Copyright 2013 Pivotal. All rights reserved. 23 23
    • HAWQ: The Crown Jewels of Greenplum  SQL compliant  World-class query optimizer  Interactive query  Horizontal scalability  Robust data management  Common Hadoop formats  Deep analytics © Copyright 2013 EMC Corporation. All rights reserved. 24
    • HAWQ High-Performance Query Processing  Interactive and true ANSI SQL support  Multi-petabyte horizontal scalability  Cost-based parallel query optimizer  Programmable analytics © Copyright 2013 EMC Corporation. All rights reserved. 25
    • HAWQ Enterprise-Class Database Services & Management  Scatter-gather data loading  Row and column storage  Workload management  Multi-level partitioning  3rd-party tool & open client interfaces © Copyright 2013 EMC Corporation. All rights reserved. 26
    • HAWQ Pre-integrated Deep Analytics  Performance via fully parallelized implementation  Consistent, user friendly SQL interfaces  Ease of data preparation  Pre-integrated MADLib support – Linear Regression – Logistic Regression – Multinomial Logisitic Regression © Copyright 2013 EMC Corporation. All rights reserved. – K-Means – Association Rules – PLDA - useful for topic modeling 27
    • GPDB – Components GPDB Resource Management Query Engine Catalog Service Planner Optimizer Executor Transaction Manager © Copyright 2013 EMC Corporation. All rights reserved. GPXF Local File System 28
    • HAWQ – Components Resource Management GPSQL Query Engine Planner Optimizer Executor Catalog Service Transaction Manager GPXF HDFS © Copyright 2013 EMC Corporation. All rights reserved. 29
    • How HAWQ Works Clients SELECT beer, price FROM Bars b, Sells s WHERE b.name = s.bar AND b.city = ‘San Francisco’ HAWQ Master Host Query Parser JDBC/ODBC SQL Console Query Optimizer HDFS Namenode HAWQ Segment Host Query Executor HAWQ Segment Host Query Executor HAWQ Segment Host Query Executor HDFS Datanode HDFS Datanode HDFS Datanode © Copyright 2013 EMC Corporation. All rights reserved. ... 30
    • How HAWQ Works Clients Optimization Context Parse Tree HAWQ Master Host Metadata Query Parser JDBC/ODBC SQL Console Query Optimizer HDFS Namenode Cost Model Resources HAWQ Segment Host Query Executor HAWQ Segment Host Query Executor HAWQ Segment Host Query Executor HDFS Datanode HDFS Datanode HDFS Datanode © Copyright 2013 EMC Corporation. All rights reserved. ... 31
    • How HAWQ Works Execution Plan Clients HAWQ Master Host Query Parser JDBC/ODBC SQL Console Query Optimizer HDFS Namenode HAWQ Segment Host Query Executor HAWQ Segment Host Query Executor HAWQ Segment Host Query Executor HDFS Datanode HDFS Datanode HDFS Datanode © Copyright 2013 EMC Corporation. All rights reserved. ... 32
    • How HAWQ Works Clients HAWQ Master Host Query Parser JDBC/ODBC SQL Console Query Optimizer HDFS Namenode HAWQ Segment Host Query Executor HAWQ Segment Host Query Executor HAWQ Segment Host Query Executor HDFS Datanode HDFS Datanode HDFS Datanode © Copyright 2013 EMC Corporation. All rights reserved. ... 33
    • How HAWQ Works Clients HAWQ Master Host Query Parser JDBC/ODBC Query Optimizer SQL Console HAWQ Segment Host Query Executor HDFS Namenode HAWQ Segment Host Query Executor D y n a m i c HDFS Datanode © Copyright 2013 EMC Corporation. All rights reserved. HAWQ Segment Host Query Executor P i p e l i n i n g ™ HDFS Datanode ... HDFS Datanode 34
    • How HAWQ Works Clients HAWQ Master Host Query Parser JDBC/ODBC SQL Console Query Optimizer HDFS Namenode HAWQ Segment Host Query Executor HAWQ Segment Host Query Executor HAWQ Segment Host Query Executor HDFS Datanode HDFS Datanode HDFS Datanode © Copyright 2013 EMC Corporation. All rights reserved. ... 35
    • HAWQ Deployment ODBC/JDBC Driver Master Servers & Name Nodes ... ... Query planning & dispatch Dynamic Pipelining Segment Servers & Data Nodes ... Query processing & data storage ... HDFS External Sources Loading, streaming, etc. © Copyright 2013 EMC Corporation. All rights reserved. 36
    • Xtension Framework  An advanced version of GPDB external tables  Enables combining HAWQ data and Hadoop data in single query Xtension Framework HDFS HBase © Copyright 2013 EMC Corporation. All rights reserved. Hive  Supports connectors for HDFS, Hbase and Hive  Provides extensible framework API to enable custom connector development for other data sources 37
    • HAWQ Benchmarks User intelligence 4.2 198 47X Sales analysis 8.7 161 19X Click analysis 2.0 415 208X Data exploration 2.7 1,285 476X BI drill down 2.8 1,815 648X © Copyright 2013 EMC Corporation. All rights reserved. 38
    • Pivotal Analytics Workbench (AWB) Commitment to Accelerating Innovation & Contributing to the Apache Community • Multi-million dollar investment by Pivotal and partners in a 1,000-node, 24-Petabyte cluster to facilitate innovation and conduct regular integration/scale testing of Apache Hadoop • Full-time, dedicated integration onboarding projects and validating each release of Apache Hadoop at-scale • Contributing back our results and findings to the open source community as well as incorporating them into the continued development of Pivotal HD © Copyright 2013 EMC Corporation. All rights reserved. 39
    • “Real” Hadoop Cluster © Copyright 2013 EMC Corporation. All rights reserved. 40
    • Leveraging Full Power of the Family © Copyright 2013 EMC Corporation. All rights reserved. 41
    • Pivotal Sessions at EMC World Session Presenter Dates/Times The Pivotal Platform: A Purpose-Built Platform for Big-DataDriven Applications Josh Klahr Tue 5:30 - 6:30, Palazzo E Wed 11:30 - 12:30, Delfino 4005 Pivotal: Data Scientists on the Front Line: Examples of Data Science in Action Noelle Sio Tue 10:00 - 11:00, Lando 4205 Thu 8:30 - 9:30, Palazzo F Pivotal: Operationalizing 1000-node Hadoop Cluster – Analytics Workbench Clinton Ooi Bhavin Modi Tue 11:30 - 12:30, Palazzo L Thu 10:00- 11:00 am, Delfino 4001A Pivotal: for Powerful Processing of Unstructured Data For Valuable Insights SK Krishnamurthy Mon 4:00 - 5:00, Lando 4201 A Tue 4:00 - 5:00, Palazzo M Pivotal: Big & Fast data – merging real-time data and deep analytics Michael Crutcher Mon 1:00 - 2:00, Lando 4201 A Wed 10:00 - 11:00, Palazzo M Pivotal: Virtualize Big Data to Make The Elephant Dance June Yang Dan Baskette Mon 11:30 - 12:30, Marcello 4401A Wed 4:00 - 5:00, Palazzo E Hadoop Design Patterns Don Miner Mon 2:30 - 3:30, Palazzo F Wed 8:30 - 9:30, Delfino 4005 © Copyright 2013 EMC Corporation. All rights reserved. 42