Your SlideShare is downloading. ×
Pivotal deep dive_on_pivotal_hd_world_class_hdfs_platform
Upcoming SlideShare
Loading in...5
×

Thanks for flagging this SlideShare!

Oops! An error has occurred.

×

Saving this for later?

Get the SlideShare app to save on your phone or tablet. Read anywhere, anytime - even offline.

Text the download link to your phone

Standard text messaging rates apply

Pivotal deep dive_on_pivotal_hd_world_class_hdfs_platform

677
views

Published on

Published in: Technology, Business

0 Comments
2 Likes
Statistics
Notes
  • Be the first to comment

No Downloads
Views
Total Views
677
On Slideshare
0
From Embeds
0
Number of Embeds
1
Actions
Shares
0
Downloads
66
Comments
0
Likes
2
Embeds 0
No embeds

Report content
Flagged as inappropriate Flag as inappropriate
Flag as inappropriate

Select your reason for flagging this presentation as inappropriate.

Cancel
No notes for slide

Transcript

  • 1. 1© Copyright 2014 EMC Corporation. All rights reserved.© Copyright 2014 EMC Corporation. All rights reserved. Deep Dive On Pivotal HD - World Class HDFS Platform Michael Goddard
  • 2. 2© Copyright 2014 EMC Corporation. All rights reserved.© Copyright 2014 EMC Corporation. All rights reserved. Agenda • Pivotal • Pivotal Business Data Lake • Introducing Pivotal HD 2.0 • Pivotal HD 2.0 and Isilon Update • Customer Success • Q&A
  • 3. 3© Copyright 2014 EMC Corporation. All rights reserved.© Copyright 2014 EMC Corporation. All rights reserved. What Matters: Apps. Data. Analytics. Apps power businesses, and those apps generate data Analytic insights from that data drive new app functionality, which in-turn drives new data The faster you can move around that cycle, the faster you learn, innovate & pull away from the competition
  • 4. 4© Copyright 2014 EMC Corporation. All rights reserved.© Copyright 2014 EMC Corporation. All rights reserved. How Pivotal Gets You There Uniquely positioned to help enterprises modernize each facet of this cycle today Comprehensive portfolio of products & services spanning Big Data, PaaS & Agile Converging these technologies into a coherent, next-gen Enterprise PaaS platform Pivotal Labs Agile Development Pivotal Data Fabric Pivotal One PaaS
  • 5. 5© Copyright 2014 EMC Corporation. All rights reserved.© Copyright 2014 EMC Corporation. All rights reserved. Pivotal’s Big Bets for the Future 1. HDFS becomes the data substrate for the next generation of data infrastructures 2. A set of integrated, consumer-grade services must evolve on top of HDFS – stream ingestion, analytical processing, and transactional serving 3. Provisioning flexibility and elasticity become critical capabilities for this data infrastructure
  • 6. 6© Copyright 2014 EMC Corporation. All rights reserved.© Copyright 2014 EMC Corporation. All rights reserved. Pivotal Business Data Lake Govern where it matters  Focus on MDM and RDM  Enforce only when sharing  Treat corporate as aggregation of local Encourage local requirements  Let the business decide what they need  Build from the bottom  Enable traceability to source  Disposable data views Distill on demand  Select only what you want  Business friendly tooling  Re-usable information maps  Rapid change cycle Store everything  Store everything ‘as is’  Include structured and unstructured data  Store it cheaply
  • 7. 7© Copyright 2014 EMC Corporation. All rights reserved.© Copyright 2014 EMC Corporation. All rights reserved. Pivotal Business Data Lake Architecture Centralized Management System monitoring System management Unified Data Management Tier Data mgmt. services MDM RDM Audit and policy mgmt. Processing Tier Workflow Management In-memory MPP database Existing Sources Unified Sources Flexible Actions Real-time ingestion Micro batch ingestion Batch ingestion Real-time insights Interactive insights Batch insights HDFS New Data Sources
  • 8. 8© Copyright 2014 EMC Corporation. All rights reserved.© Copyright 2014 EMC Corporation. All rights reserved. Pivotal Business Data Lake Architecture Centralized Management Unified Data Management Tier Data Dispatch MDM RDM Data Dispatch Processing Tier Spring XD Pivotal GemFire XD HAWQ Unified Sources Flexible Actions Clickstream Sensor Data Weblogs Network Data CRM Data ERP Data Pivotal GemFire Pivotal RabbitMQ Redis Pivotal CFPivotal HD Command Center Existing SourcesNew Data Sources
  • 9. 9© Copyright 2014 EMC Corporation. All rights reserved.© Copyright 2014 EMC Corporation. All rights reserved. How is a Business Data Lake Different? Business Data LakeCriteria EDW Common data model Base class = standard data Derived classes = local data Single class = single view across the enterprise Data quality Full spectrum 1 0 0 1 01 0 0 1 0 1 1 1 0 Data integration Multiple interfaces SQL, SAS, R, MapReduce, NoSQL SQL access integration with SAS, R and other analytical interfaces Mixed workload with varying QoS Support low latency, interactive and batch Limited QoS separation required
  • 10. 10© Copyright 2014 EMC Corporation. All rights reserved.© Copyright 2014 EMC Corporation. All rights reserved. Introducing Pivotal HD 2.0 • Foundation for Business Data Lake • World’s Most Advanced Real- Time Analytics Platform • Most Extensive Set of Advanced Analytical Toolsets
  • 11. 11© Copyright 2014 EMC Corporation. All rights reserved.© Copyright 2014 EMC Corporation. All rights reserved. Pivotal HD Architecture HDFS HBas e Pig, Hive, Mahout Map Reduce Sqoop Flume Resource Management & Workflow YARN ZooKeeper Apache Pivotal Command Center Configure, Deploy, Monitor, Manage Spring XD Pivotal HD Enterprise Spring Xtension Framework Catalog Services Query Optimizer Dynamic Pipelining ANSI SQL + Analytics HAWQ – Advanced Database Services Distributed In-memory Store Query Transactions Ingestion Processing Hadoop Driver – Parallel with Compaction ANSI SQL + In-Memory Pivotal GemFire XD – Real-Time Database Services MADlib Algorithms Oozie Virtual Extensions GraphLab, Open MPI
  • 12. 12© Copyright 2014 EMC Corporation. All rights reserved.© Copyright 2014 EMC Corporation. All rights reserved. New Apache Hadoop Features in Pivotal HD 2.0 • Apache Hadoop 2.2 enables enterprise operationalization features such as NFS and Snapshots • Hive 0.12 is faster, has better scalability, and broader SQL data type support • Pig 0.12 (incl. PiggyBank) increases productivity and appeal for broader set of users • HBase 0.96 improves in mean time between recovery and modularization for easy upgrade and reduced dependencies
  • 13. 13© Copyright 2014 EMC Corporation. All rights reserved.© Copyright 2014 EMC Corporation. All rights reserved. Hadoop at the Center Enabling the Data-Driven Enterprise Hadoop as a Service Big Data On-Demand GemFire XD In-Memory Real-time Analytics Spring XD Building Big Data Apps Open Source Algorithm Libraries Chorus Big Data Collaboration Fastest SQL Query Engine
  • 14. 14© Copyright 2014 EMC Corporation. All rights reserved.© Copyright 2014 EMC Corporation. All rights reserved. Real-Time Analytics • Adds fast data ingest, and real-time event processing and query performance, enabling SQL users to rapidly analyze and react to high volumes of events on HDFS • Enables the creation of low latency, scale out OLTP applications integrated out of the box with a big data store. • Creates a single platform for Analytics and OLTP, removing the need for an ETL process • Supports changes to database tables while still complying to the immutability of HDFS
  • 15. 15© Copyright 2014 EMC Corporation. All rights reserved.© Copyright 2014 EMC Corporation. All rights reserved. Real-Time Data Services on Pivotal HD Pivotal GemFire XD HAWQ Pivotal Extension Framework Model Refresh MapReduce I/P & O/P Formatter Native Persistence Command Center Model Refresh Online Apps Analytic Apps Sensor Data / Feeds Pivotal HD Enterprise Shared Data Re-evaluate Model Re-evaluate Model HDFS
  • 16. 16© Copyright 2014 EMC Corporation. All rights reserved.© Copyright 2014 EMC Corporation. All rights reserved. Pivotal GemFire XD 1.0 Major Features Enterprise real-time data processing platform for SLA critical applications; enables users to rapidly and reliably analyze and react to high volumes of events while leveraging 10s of TBs of in-memory reference data. Cloud Scale Real-Time Platform Seamless Pivotal HD Integration Optimized for Real-Time Analytics • Very low & predictable latencies at high and variable loads • 10s of TBs in-memory (MemScale) • Multi-tiered caching • Real-time event processing • Rolling upgrade support • SQL-based queries • Support structured data • Java stored procedures • Deep Spring Data integration • Scale to HDFS with policy driven in-memory data retention • Online and offline querying of HDFS data • ETL-less bi-directional integration with other Pivotal HD services • Pivotal Extension Framework Integration • ICM Integration Enterprise-Class Reliability • Distributed transactions (JTA) • HA through in-memory redundancy • Active-active deployments across WAN • JMX based scalable management • Visual monitoring through Pulse
  • 17. 17© Copyright 2014 EMC Corporation. All rights reserved.© Copyright 2014 EMC Corporation. All rights reserved. Deep Scalable Analytics • User Defined Functions: PL/R, PL/Java, PL/Python enable writing UDFs in additional languages that execute inside the database, improving performance • Parquet columnar open storage format delivers significant performance and scalability improvements • Richer set of open source machine learning algorithms helps conduct rapid data science experiments on relational data
  • 18. 18© Copyright 2014 EMC Corporation. All rights reserved.© Copyright 2014 EMC Corporation. All rights reserved. Deep Scalable Analytics Provides data-parallel implementations of mathematical, statistical and machine-learning methods for structured and unstructured data.
  • 19. 19© Copyright 2014 EMC Corporation. All rights reserved.© Copyright 2014 EMC Corporation. All rights reserved. • HAWQ 1.2 Deep Scalable Analytics • Linear Regression • Logistic Regression • Multinomial Logistic Regression • K-Means • Association Rules • Latent Dirichlet Allocation • Naïve Bayes • Elastic Net Regression • Decision Trees / Random Forest • Support Vector Machines • Cox Proportional Hazards Regression • Descriptive Statistics • ARIMA
  • 20. 20© Copyright 2014 EMC Corporation. All rights reserved.© Copyright 2014 EMC Corporation. All rights reserved. Pivotal vs. PL/R • Interface is R client • Execution is in database • Parallelism handled by PivotalR • Supports a portion of R PivotalR • Interface is SQL client • Execution is in R • Parallelism via SQL function invocation • Supports all of R PL/R
  • 21. 21© Copyright 2014 EMC Corporation. All rights reserved.© Copyright 2014 EMC Corporation. All rights reserved. HAWQ: SQL on Hadoop, Format Agnostic Pivotal HD: HDFS Data Lake Future formats … ANSI SQL
  • 22. 22© Copyright 2014 EMC Corporation. All rights reserved.© Copyright 2014 EMC Corporation. All rights reserved. HAWQ Continue to Soar  NameNode High Availability (HA) Support improves availability of query processing with full Hadoop fault tolerance  Error Table helps to debug data errors  Parquet file format: columnar data storage for HDFS  HAWQ expansion increases performance (concurrency/throughput) by expanding query processing to newly added data nodes
  • 23. 23© Copyright 2014 EMC Corporation. All rights reserved.© Copyright 2014 EMC Corporation. All rights reserved. NameNode HA Support • Feature: – Automatic failover to secondary NameNode when primary fails • Benefits: – Fully fault tolerant to NameNode failures – Improved availability of query processing – Integrated into Hadoop availability model
  • 24. 24© Copyright 2014 EMC Corporation. All rights reserved.© Copyright 2014 EMC Corporation. All rights reserved. Error Table • Feature: – System table for storing non-conforming data • Benefits: – Eliminates erroneous data load – Reduces retries during load – Helps to debug errors in data structures
  • 25. 25© Copyright 2014 EMC Corporation. All rights reserved.© Copyright 2014 EMC Corporation. All rights reserved. Parquet • Features: – Open storage format – Hybrid row/column open storage format – Configurable Parquet or AO/CO format support – Compression Type: Snappy and Gzip – Additional data type support – Parquet Input Format Reader API • Benefits: – Delivers significant performance and scalability improvements – Industry standard compression: Saves storage – Usable in MapReduce/Hive work loads
  • 26. 26© Copyright 2014 EMC Corporation. All rights reserved.© Copyright 2014 EMC Corporation. All rights reserved. HAWQ Expansion • Features: – Expand HAWQ nodes to additional DataNodes – Expand # of segments per HAWQ segment host • Benefits: – Expand query processing – Increase performance by utilizing maximum CPU/resources – Increased concurrency/throughput
  • 27. 27© Copyright 2014 EMC Corporation. All rights reserved.© Copyright 2014 EMC Corporation. All rights reserved. Big Computing and Graph Analytics • Open MPI is one of the most mature parallel computing frameworks now available within HDFS, eliminating costly data movement and shortening data science cycles • GraphLab is a graph-based library of machine learning algorithms – allowing Data Scientists and Analysts to leverage popular algorithms such as PageRank, collaborative filtering and computer vision in HDFS Open MPI
  • 28. 28© Copyright 2014 EMC Corporation. All rights reserved.© Copyright 2014 EMC Corporation. All rights reserved. Background • Hadoop MapReduce is not a good fit for iterative applications (like graph computing, machine learning, etc.) • User needs to build separate system/clusters to support those applications • MPI is (one of) the most mature/used parallel computing frameworks – MPI = Big Computing, Hadoop = Big Data – MPI + Hadoop = Big Computing + Big Data
  • 29. 29© Copyright 2014 EMC Corporation. All rights reserved.© Copyright 2014 EMC Corporation. All rights reserved. MPI Background • What is MPI? – “a standardized and portable message-passing system designed by a group of researchers from academia and industry to function on wide variety of parallel computers” Wikipedia • What is Open MPI? – One of the most popular implementations of MPI, community supported • What is Hamster? – “Hadoop And Mpi on the same cluSTER
  • 30. 30© Copyright 2014 EMC Corporation. All rights reserved.© Copyright 2014 EMC Corporation. All rights reserved. GraphLab • Topic Modeling contains applications like LDA, which can be used to cluster documents and extract topical representations. • Graph Analytics contain applications like PageRank and triangle counting, which can be applied to general graphs to estimate community structure. • Clustering contains standard data clustering tools such as k-means • Collaborative Filtering contains a collection of applications used to make predictions about users interests and factorize large matrices. • Graphical Models contain tools for reasoning about structured noisy data. • Computer Vision contains a collection of tools for reasoning about images.
  • 31. 31© Copyright 2014 EMC Corporation. All rights reserved.© Copyright 2014 EMC Corporation. All rights reserved. Pivotal HD: Built for Data Science Relational Advanced Analytics Data Science on Pivotal HD Graph Advanced Analytics SQL R Python Java Languages: Custom Analytic Functions - UDFs
  • 32. 32© Copyright 2014 EMC Corporation. All rights reserved.© Copyright 2014 EMC Corporation. All rights reserved. World’s Leading Experts Pivotal Labs – Pivotal Data Labs On Demand Services Pivotal Data Dispatch BATCH BATCH INTERACTIVE INTERACTIVEHAWQGreenplum DB Unlimited Pivotal HD REAL-TIME REAL-TIMEGemFire XDGemFire | SQLFire
  • 33. 33© Copyright 2014 EMC Corporation. All rights reserved.© Copyright 2014 EMC Corporation. All rights reserved. Pivotal Enables Hadoop Market Adoption Data Lakes Unify Unstructured and Structured Data Access Big Data Apps Build analytic and transaction-led applications impacting top line revenue Data-Driven Enterprise App Dev and Operational Management on HDFS Data Architecture
  • 34. 34© Copyright 2014 EMC Corporation. All rights reserved.© Copyright 2014 EMC Corporation. All rights reserved. Pivotal HD 2.0 and Isilon Update • Isilon aligns with our Enterprise Grade Message • Pivotal Command Center 2.2 (part of Pivotal HD 2.0) – Works with Pivotal HD 1.1.1 – ‘Down’ status of HDFS is removed when Isilon is configured • Isilon has accelerated their integration from Q4 to Q3 for HDFS 2.2
  • 35. 35© Copyright 2014 EMC Corporation. All rights reserved.© Copyright 2014 EMC Corporation. All rights reserved. Large Mid-Market Financier Builds Foundation to Store All Data of Interest, Convert Insights to Value- added Services Challenge: • Mid Market financier seeks to maintain high margins through value-added services • Realized that critical insights could come from many sources, but much was deleted due to storage cost • Frustrated by lack of ability to blend data fabric, build analytics on top, create applications on top of this. Solution: • Data Lake provides accessibility of any information of interest through familiar SQL-Like interface • Provide foundation for creation of Analytics and Applications as value added services: forecast demand based on social media sentiment, analytics on fleet vehicle usage
  • 36. 36© Copyright 2014 EMC Corporation. All rights reserved.© Copyright 2014 EMC Corporation. All rights reserved. Major TV Network Replaces Teradata with Pivotal Builds Infrastructure to Capture $40 Million in Untapped Revenue Challenge: • Ad Inventory is an inherently perishable product, and is subject to inefficient, “traditional” selling process. • Upward trend in volume and traffic due to higher ad quality, mobile devices. • Inability to react: 7 hour lag time in communication between ad fulfillment and sales teams, this was exacerbated by major broadcast events. Solution: • Reduced 7-hour lag time to under 1 hour – enabling network sales to communicate delivered impressions, forecast spend inventory and sell more effectively • Maximized profit by selling across brands/channels – allowing network to better leverage non-premium inventory
  • 37. 37© Copyright 2014 EMC Corporation. All rights reserved.© Copyright 2014 EMC Corporation. All rights reserved. Home Appliance Maker Lays Foundation for “Smart” Connected Devices, Big Data-based Decisions Challenge: • Prepare for next generation appliances: “smart” connected devices, controlled by mobile phone • Silo’ed environment including Teradata, SAS, HP made it difficult to derive true insights across disparate data Solution: • Enable Innovation, improve service performance through appliances that provide feedback based on output, environmental factors • Improve marketing efficiency with targeted campaigns based on market demographics, buying indicators • Better understand requirements for parts inventory based on current appliances lifecycle
  • 38. 38© Copyright 2014 EMC Corporation. All rights reserved.© Copyright 2014 EMC Corporation. All rights reserved. National Healthcare Organization Replaces Aging IBM Platform, Seeds Data Lake as Hadoop Beachhead Challenge: • Aging IBM Infrastructure could not support new SAS Access and Visual Analytics Technology • Interest in enabling infrastructure to support for-profit healthcare analytics as a service business • Sought to provide refined data sets to other insurance companies for their own research, needed way to cleanse data Solution: • Stepwise evolution of platform onto GPDB, one of two certified platform partners for running visual analytics • Established data lake as platform for upload, cleansing and conversion of private data into publicly consumable datasets
  • 39. 39© Copyright 2014 EMC Corporation. All rights reserved.© Copyright 2014 EMC Corporation. All rights reserved. Aviation: Predictive Maintenance Challenge: • An airplane’s comprehensive “gate to gate” flight data didn’t exist in a single place for reporting • Each individual flight can generate approximately 1 TB of data - economically infeasible in traditional EDW • To maintain profitability of GE Aviation's Contract Service Agreements, new analytical methods and approaches were required Solution: • Ingest all data to a data lake for data discovery and model development to increase wing time, greater aircraft uptime, improve customer satisfaction and airline profitability • Improved capacity for preventative maintenance rather than remediation, reducing expense and liability Pivotal Solution includes: GPDB, PHD, Alpine, Chorus
  • 40. 40© Copyright 2014 EMC Corporation. All rights reserved.© Copyright 2014 EMC Corporation. All rights reserved. Brazilian Telco Provider Establishes Foundation for Data-Driven Culture Challenge: • Poor call quality caused massive loss of customers. No Insight into root cause of issues. • Increased scrutiny from regulators, but infrastructure did not support the requests for information needed • Difficulties with Scale: Call Data Record generates 2 Billion new records per day, no info on dropped calls due to capacity Solution: • New Data Warehouse infrastructure contains both dropped and completed calls for analysis, 3 month capacity • Hadoop infrastructure with familiar SQL interface stores 5x volume at half cost of Teradata • Reports which took 2 Months to obtain now take 1 day Pivotal Solution includes: PHD, HAWQ
  • 41. 41© Copyright 2014 EMC Corporation. All rights reserved.© Copyright 2014 EMC Corporation. All rights reserved. Pivotal HD 2.0 Summary • The Foundation for Business Data Lake • The World’s Most Advanced Hadoop Stack – Pivotal HD now based on Apache 2.2 – Real-time SQL, in-memory over Pivotal HD and integrated into Spring: Pivotal GemFire XD – Enhanced Interactive SQL over Pivotal HD: HAWQ • World’s Most Advanced Big Data Analytic Platform – Most extensive set of machine learning libraries: MADlib, R and GraphLab
  • 42. 42© Copyright 2014 EMC Corporation. All rights reserved.© Copyright 2014 EMC Corporation. All rights reserved. Pivotal HD 2.0 demo: PivotalBooth
  • 43. 43© Copyright 2014 EMC Corporation. All rights reserved.© Copyright 2014 EMC Corporation. All rights reserved. Thank You