Webinar turbo charging_data_science_hawq_on_hdp_final

© Hortonworks Inc. 2011 – 2015. All Rights Reserved
Turbocharging Your Data Science with
HAWQ on the Hortonworks Data Platform
We Do Hadoop

Your Hosts
Michael Cucchi
•  Sr. Director of Outbound Product for Pivotal's Data,
Mobile, and IoT solutions
•  20 years of engineering, management, and
marketing experience in the high-tech industry
@mikecucchi
Matt Morgan
•  Vice President, Global Product Marketing
•  20 year history as a marketing and product
executive in cloud, SaaS, and big data businesses
@forwardtension

Establish Hadoop as the
Foundational Technology
of the Modern Enterprise
Data Architecture
Year Founded In 2011, 24 engineers from the original Hadoop
team at Yahoo! spun out to form Hortonworks.
Ticker Symbol NASDAQ: HDP
Headquarters Santa Clara, CA
Business Model Open Source Software Support Subscriptions,
Training and Consulting Services
Non-GAAP Billings Grew from zero to over $120 million
on an annualized basis in 11 quarters
Subscription
Customers
437 in 11 quarters
with 105 added in Q1-2015 alone.
Support 24×7, global web, telephone support
Partners 1100 joint engineering, strategic reseller,
technology, and system integrator partners
Employees 650+
Global Operations 17 countries
#1
28 out of 86 Apache Hadoop committers
Hortonworks employs the largest group of Hadoop committers
under one roof; more than twice any other company.
#1
165 Apache committer seats for projects in HDP
Our committers work in 20+ projects on the data access,
management, security, operations, and governance needs of
the enterprise; more than twice any other company.
Hortonworks Quick Facts
The Forrester Wave™ Big Data Hadoop Solutions
We are recognized as a leader in Hadoop by Forrester
Research based on the strengths of our offerings and strategy

Traditional Systems Under Pressure
Challenges
•  Constrains data to app
•  Can’t manage new data
•  Costly to Scale
Business Value
Clickstream
Geolocation
Web Data
Internet of Things
Docs, emails
Server logs
2012
2.8 Zettabytes
2020
40 Zettabytes
LAGGARDS
INDUSTRY
LEADERS
1
2 New Data
ERP CRM SCM
New
Traditional

Early Hadoop: The Start of a Modern Data Architecture
Apache Hadoop is an open source data platform for
managing large volumes of high velocity and variety of data
•  Built by Yahoo! to be the heartbeat of its ad & search business
•  Donated to Apache Software Foundation in 2005 with rapid adoption by
large web properties & early adopter enterprises
•  Incredibly disruptive to current platform economics
Traditional Hadoop Advantages
ü  Manages new data paradigm
ü  Handles data at scale
ü  Cost effective
ü  Open source
Traditional Hadoop Had Limitations
Batch-only architecture with limited analytic
options
Single purpose clusters, specific data sets
Difficult to integrate with existing investments
Not enterprise-grade
Application
Storage
HDFS
Batch Processing
MapReduce

Today: Modern Data Architecture Unifies Data & Processing
Modern Data Architecture
•  Enable applications to have access to
all your enterprise data through an
efficient centralized platform
•  Supported with a centralized
approach governance, security and
operations
•  Versatile to handle any applications
and datasets no matter the size or
type
Clickstream
Web

&
Social

Geoloca3on
Sensor

&
Machine

Server

Logs

Unstructured

SOURCES
Existing Systems
ERP
CRM
SCM

ANALYTICS
Data
Marts
Business
Analytics
Visualization
& Dashboards
ANALYTICS
Applications
Business
Analytics
Visualization
& Dashboards
°
°
°
°
°
°
°
°
°
°
°
°
°
°
°
°
°
°
°
°
°
°
°
°
°
°
°
°
°
°
HDFS
(Hadoop Distributed File System)
YARN: Data Operating System
Interactive Real-TimeBatch Partner ISVBatch Batch
MPP
EDW

OPERATIONAL
TOOLS

DEV
&
DATA
TOOLS

INFRASTRUCTURE

Partnerships Enrich the Hadoop Ecosystem
Clickstream
Web

&
Social

Geoloca3on
Sensor

&
Machine

Server

Logs

Unstructured

SOURCES
Existing Systems
ERP
CRM
SCM

ANALYTICS
Data
Marts
Business
Analytics
Visualization
& Dashboards
ANALYTICS
Applications
Business
Analytics
Visualization
& Dashboards
°
°
°
°
°
°
°
°
°
°
°
°
°
°
°
°
°
°
°
°
°
°
°
°
°
°
°
°
°
°
HDFS
Deep Partnerships
Hortonworks engages
in deep engineered relationships
with the leaders in the data center,
such as EMC, Microsoft, Teradata,
Red Hat, HP, SAS & SAP
Broad Partnerships
Over 1100 partners work with us to
certify their applications to work with
Hadoop so they can extend big data
to their users
YARN: Data Operating System
EDW

Interactive Real-TimeBatch Partner ISV

Hadoop Adoption Follows a Predictable Journey
Cost Optimization, new analytic apps, and ultimately to a data lake

Hadoop Driver: Cost optimization
Archive Data off EDW
Move rarely used data to Hadoop as active
archive, store more data longer
Offload costly ETL process
Free your EDW to perform high-value functions
like analytics & operations, not ETL
Enrich the value of your EDW
Use Hadoop to refine new data sources, such as
web and machine data for new analytical context
ANALYTICS
Data
Marts
Business
Analytics
Visualization
& Dashboards
HDP helps you reduce costs and optimize the value associated with your EDW
ANALYTICSDATASYSTEMS
Data
Marts
Business
Analytics
Visualization
& Dashboards
HDP 2.2
ELT
°
°
°
°
°
°
°
°
°
°
°
°
°
°
°
°
°
°
°
N
Cold Data,
Deeper Archive
& New Sources
Enterprise Data
Warehouse
Hot
MPP
In-Memory
Clickstream
Web

&
Social

Geoloca3on
Sensor

&
Machine

Server

Logs

Unstructured

Existing Systems
ERP
CRM
SCM

SOURCES

Hadoop Driver: Advanced analytic applications
Single View:
Improve acquisition & retention
•  HDP enables a single view of each
customer, allowing organizations to
provide targeted, personalized
customer experiences.
•  Single view reduces attrition,
improves cross-sell and improves
customer satisfaction.
Predictive Analytics:
Identify next best action
•  HDP captures, stores and processes
large volumes of data streaming
from connected devices
•  Stream processing and data science
help introduce new analytics for real-
time and batch analysis
Data Discovery:
Uncover new findings
•  HDP allows exploration of new data
types and large data sets that were
previously too big to capture, store &
process.
•  Unlock insights from data such as
clickstream, geo-location, sensor,
server log, social, text and video
data.

360° Customer View Boosts Sales at Home Supply Retailer
Problem: Lack of unified customer record across all channels
clouded targeting for marketing campaigns
•  No “golden record” for analytics on customer buying behavior across all channels
•  Data repositories on web traffic, POS transactions and in-home services existed in
isolation of each other
•  Data storage costs were increasing, without a corresponding increase in value
Solution: HDP data lake drives golden customer record, targeted
marketing, and reduction in data storage expenses
•  Golden record enables targeted, personalized marketing with higher success rates
•  Data warehouse offload saved millions of dollars in recurring expense
•  Price optimization versus competitors à several millions in top-line revenue growth
New Analytic Applications
Clickstream, Unstructured
and Structured Data
Retail
Major home improvement
retailer
RT2
Why Hadoop?
Single View

Responsive Patient Treatment with Real-time Monitoring of Vitals
Problem: Inability to store and access sufficient data for medical
decision support in real time
•  9 million patient records on a legacy system were not searchable nor retrievable
•  Cohort selection for research projects was slow, despite abundance of data
•  Clinicians had minimal access to historical data gathered across all patients
Solution: Unified data lake improves patient health, speeds
research
•  Legacy system retired immediately, saving $500K in annual recurring expense
•  Records stored with patient identification for clinical use, same data presented
anonymously to researchers for cohort selection
•  Wireless patches transmit vital signs, algorithms notify doctors of high risk patterns
•  Heart patients weigh themselves from home, algorithms notify doctors about unsafe
weight changes and recommend a visit to the clinic
New Analytic Applications
Sensor, Social Data
& ETL Offload
Healthcare
Public university teaching
hospital
HC2
Why Hadoop?
Predictive Analytics

Hadoop Driver: Enabling the Data LakeSCALE
SCOPE
Data Lake Definition
•  Centralized Architecture
Multiple applications on a shared data
set with consistent levels of service
•  Any App, Any Data
Multiple applications accessing all data
affording new insights and opportunities.
•  Unlocks ‘Systems of Insight’
Advanced algorithms and applications
used to derive new value and optimize
existing value.
Drivers:
1.  Cost Optimization
2.  Advanced Analytic Apps
Goal:
•  Centralized Architecture
•  Data-driven Business
DATA LAKE
Journey to the Data Lake with Hadoop
Systems of Insight

Case Study: 12-Month Hadoop Evolution at TrueCar
DataPlatformCapabilities
12 months execution plan
June 2013
Begin
Hadoop
Execution
July 2013
Hortonworks
Partnership
May ‘14
IPO
Aug 2013
Training
& Dev
Begins
Nov 2013
Production
Cluster
60 Nodes
2 PB
Jan 2014
40% Dev
Staff
Proficient
Dec 2013
Three
Production
Apps
(3 total)
Feb 2014
Three More
Production
Apps
(6 total)
12 Month Results at TRUECar
•  Six Production Hadoop Applications
•  Sixty nodes/2PB data
•  Storage Costs/Compute Costs
from $19/GB to $0.12/GB
“We addressed our data platform capabilities
strategically as a pre-cursor to IPO.”

Hortonworks Data Platform
Hadoop for the Enterprise

HDP Makes Hadoop Enterprise-Ready
Hortonworks Data Platform
Multi-tenant data platform built on a centralized
architecture of shared enterprise services
YARN: data operating system
Governance Security
Operations
Resource management
Existing
applications
New
analytics
Partner
applications
Data access: batch, interactive, real-time
Storage
Key benefits
Consolidates all data sets
Delivers real-time insights
Integrates with data center
Scalable and affordable

Any application
Batch, interactive, and real-time
Any data
Existing and new datasets
Anywhere
Complete range of deployment options
Commodity Appliance Cloud
HDP Makes Hadoop Pervasive
Existing
applications
New
analytics
Partner
applications
Data access: batch, interactive, real-time

An “Any Application” Example: Spark in HDP
Delivering a production-ready
experience for Spark applications
•  Centralized Resource Management
Integrated with YARN
•  Consistent Operations
Provisioned and managed by Ambari
•  Comprehensive Security
Runs within secure clusters
•  Deployable Anywhere
Windows, Linux, on-premises or cloud;
consistent Cloudbreak launch experience
Governance Security
Operations
Resource management
Storage

BI / Analytics
(Hive)
IoT Apps
(Storm, HBase, Hive)
An “Anywhere” Example: Cloudbreak and HDP
Dev / Test
(all HDP services)
Data Science
(Spark)
Cloudbreak
1. Pick a Blueprint
2. Choose a Cloud
3. Launch HDP!
Example Ambari Blueprints:
IoT Apps, BI / Analytics, Data Science, Dev / Test

“Hortonworks loves and lives
open source innovation”
World Class Support and Services.
Hortonworks' Customer Support received a
maximum score and was significantly higher
than both Cloudera and MapR
A Leader in Hadoop
The Forrester Wave™
Big Data Hadoop Solutions
Q1 2014

INRASTRUCTURE
Pivotal in the Modern Data Architecture
OPERATIONS TOOLS
Provision, Manage &
Monitor
DEV & DATA TOOLS
Build & Test
DATASYSTEMSAPPLICATIONS
Repositories
ROOMS
Statistical
Analysis
BI / Reporting,
Ad Hoc Analysis
Interactive Web
& Mobile Applications
Enterprise
Applications
EDW MPP
RDBMS
EDW
MPP
SOURCES
OLTP, ERP,
CRM Systems
Documents
& Emails
Web Logs,
Click Streams
Social
Networks
Machine
Generated
Sensor
Data
Geo-location
Data
On Premise, Cloud,
Appliance
Governance
&Integration
Security
Operations
Data Access
Data Management
YARNGreenplum
Gemfire HAWQ

22© Copyright 2014 Pivotal. All rights reserved. 22© Copyright 2014 Pivotal. All rights reserved.
Turbo Charging Data
Science with HAWQ

23© 2015 Pivotal Software, Inc. All rights reserved.
Pivotal By the Numbers
FOUNDED APRIL 2013
1700+ EMPLOYEES
FUNDED BY EMC, VMWARE, AND GE
HUNDREDS OF CUSTOMERS
PIVOTAL DATA
>$100M in data software bookings in 2014
PIVOTAL CLOUD FOUNDRY
Fastest revenue growth in an open source project in history
>$40M in first year for Pivotal Cloud Foundry in 2014 (subscription)
BIG DATACLOUD
PLATFORM
AGILE

Software is Eating the World
Data Is Fueling Software

The Data Divide
BIG DATA
CHASM
70%
of data
generated by
customers
80%
of data stored
3%
prepared for
analysis
0.5%
being
analyzed
<0.5%
being
operationalized

26© Copyright 2014 Pivotal. All rights reserved.
Pivotal Business Data Lake Architecture
Ingestion
Tier
Insights
TierSystem monitoring System management
Processing Tier
Workflow management
Distillation Tier
HDFS storage
Unstructured and structured data
In-memory
MPP database
Real-time
Micro batch
Mega batch
SQL
NoSQL
SQL
MapReduce
Query interfaces
SQL
Sources Action Tier
Real-time
ingestion
Micro batch
ingestion
Batch ingestion
Real-time
insights
Interactive
insights
Batch insights

The Data Driven Enterprise Journey
STORE
•  Structured
•  Unstructured
•  High Volume
•  High Velocity
ANALYZE
•  Predictive Analytics
•  Machine Learning
•  Advance Data Science
•  Realtime Analytics
DEVELOP
•  Advanced Analytic Pipelines
•  Realtime Analytical Applications
•  Global Scale Data-Driven
Applications
•  Enterprise, Consumer, IoT, and
Mobile
INNOVATE
•  Agile Dev Expertise
•  DevOps
•  Hybrid Cloud
•  Continuous Delivery
•  Closed Loop Applications
AGILE DEVELOPMENT
BIG DATA
PREDICTIVE ANALYTICS
ENTERPRISE PAAS

Technical Observations
•  SQL is today and will remain the most valuable workload on Hadoop
•  While Hadoop continues to mature, focused MPP SQL will remain
important
•  Scale out in-memory processing will have significant enterprise
adoption and impact into the future
•  Streaming and Machine Learning will continue to gain value
•  Open Source is becoming critical to enterprise investment decisions

®
Pivotal BDS + Hortonworks HDP = The Complete Solution
Pivotal Data Engineering Pivotal LabsPivotal Data Science
HDP

SQL on Hadoop Ecosystem HAWQ
Challenges Requirements
•  Complex joins not supported •  Complex joins at performance
•  Advanced analytics support •  Advanced analytics at scale within SQL
•  Interactive query latency issues •  Fast interactive queries on large data
•  Ad-hoc query performance issues •  Strong ad-hoc query support in optimizer
•  SQL analytic query coverage issues •  Full analytic SQL compliance
•  Concurrent query throughput issues •  High query throughput for mixed workloads

HAWQ
HAWQ: Enterprise Class SQL on Hadoop
•  Leverages market leading Greenplum technology
•  100% ANSI SQL Compliant for analytic workloads
•  Advanced cost-based query optimizer
•  Highest performing SQL on Hadoop
•  Polymorphic storage with advanced compression
•  Industry differentiating data federation with PXF*
•  Built-in advanced analytics for data science (MADLib)
•  Supports all major file HDFS file formats (AVRO, Parquet, HDFS)
•  Integrated with leading analytical tools out-of-the-box
HAWQ
*PXF = Pivotal eXtension Framework

Business Benefits
Feature Benefit
Rich and compliant SQL dialect •  Powerful and portable SQL apps
•  Leverage large SQL-based ecosystems
TPC-DS compliance •  Enable a wide range of use cases
•  Avoid surprises in production
Flexible/efficient joins at linear scale Off-load EDW workloads at a much lower cost
Deep analytics + machine learning Predictive/advanced learning use cases at scale
Data federation capabilities Build use cases with diverse/external data assets
without data movement
High availability and fault tolerance Off-load business critical workloads from EDW
Native Hadoop file format support Reduce ETL and data movement = lower costs
HAWQ

Pivotal Query Optimizer (PQO)
For HAWQ and Greenplum Database
HAWQ
Turns a SQL query into an execution plan
Greenplum DB
Ÿ  Leading Cost Based Optimizer for BIG data
Ÿ  Applies all possible optimizations at the same time
–  Considers many more plan alternatives
–  Optimizes a wider range of queries
–  Optimizes memory usage
Ÿ  New Extensible Code Base
–  Rapid adoption of emerging technologies
PIVOTAL VALUE-ADDED FUNCTIONALITY

Configuring and Managing HAWQ
with Ambari
•  Install HAWQ/PXF Ambari plugin
RPM
•  Restart Ambari
•  Add HAWQ/PXF service like any
other Hadoop component
HAWQ

Pivotal eXtension Framework (PXF)
•  Enables connectivity between HAWQ and
other services (Hive, HBase).
•  Provides an extensible framework to add
support for custom services
•  Operates as a separate service in Hadoop
Industry differentiators
•  Low latency on large data sets
•  Extensible and customizable
•  Considers cost model of federated sources
HAWQ
HDFS
Hive
HBase
P
X
F
Services
HAWQ

Data Driven Journey with Pivotal Big Data Suite
STORE
•  Structured
•  Unstructured
•  High Volume
•  High Velocity
ANALYZE
•  Predictive Analytics
•  Machine Learning
•  Advance Data Science
•  Realtime Analytics
DEVELOP
•  Advanced Analytic Pipelines
•  Realtime Analytical Applications
•  Global Scale Data-Driven
Applications
•  Enterprise, Consumer, IoT, and
Mobile
INNOVATE
•  Agile Dev Expertise
•  DevOps
•  Hybrid Cloud
•  Continuous Delivery
•  Closed Loop Applications
AGILE DEVELOPMENT
BIG DATA
PREDICTIVE ANALYTICS
ENTERPRISE PAAS
Spring XD
Spark
Pivotal HD &
Open Data Platform
Spring XD
Pivotal Greenplum
Database
Pivotal HAWQ
Spring XD
Pivotal GemFire
Redis
Rabbit MQ
Spring IO
Groovy
Pivotal BDS on PCF
Pivotal Cloud Foundry
Pivotal LabsData ScienceData Engineering

Putting it All Together
DATA FEEDS TRANSACTIONAL APPS ANALYTIC APPS
Expert Systems &
Machine Learning
Advanced
Analytics
Real-Time
Data
Data Stream Pipeline
HDFSData Lake
Distributed
Computing

Putting it All Together
DATA FEEDS TRANSACTIONAL APPS ANALYTIC APPS
GemFire
Ingest
Filter
Enrich
Sink
SpringXD
HAWQ GPDB

Demo: HAWQ on HDP
bit.ly/HAWQonHDPVideo
Tutorial: HAWQ on Sandbox
bit.ly/HAWQonHDPTutorial

Introducing
The Open
Data
Platform
Initiative

A shared industry effort to help promote and advance
the state of Apache Hadoop® and Big Data
technologies for the Enterprise

The Open Data Platform will accelerate the delivery of
Big Data solutions by providing a well-defined
platform called ‘The ODP Core’

The ODP Core
▪  The ODP Core is the kernel over which the industry can
build enterprise-class Apache Hadoop® solutions
–  Simplifying development of interoperable technologies
▪  Created by the ODP Developer Community
–  A team of cross industry technical experts
–  Individual, or member company developers – anyone can participate
▪  Using an open and transparent planning and release
process that follows the Apache Way
–  Interoperability within and beyond the ODP Core drives a broad set of use cases
and rapid market growth

Delivering
Enterprise
Requirements
& Real-world
Experience
ODP Member Companies
•  Diverse representation of the Big Data eco-system
–  End users, ISVs, Systems Integrators, Distribution vendors, etc.
–  Any company can join the Open Data Platform
•  A forum for the Enterprise to define its Big Data
requirements
–  Industry groups (SIGs) to align on common industry practices and
challenges
•  Direct feedback and participation in the ODP Core
–  Real world experience determining what is Enterprise grade

A Simple Beginning For The ODP Core
▪  The ODP Core is starting with a small number of projects
–  Enables a rapid start for the Initiative and an industry driven definition
▪  All members decide how the ODP Core evolves
–  All members are responsible for choosing projects to include in the ODP Core
–  Platinum, Gold and Silver member companies = One Member / One Vote
HDFS
YARN
Map Reduce
Ambari
ü  Deployable Hadoop configuration
ü  Improves interoperability
ü  Gives customers more freedom
ü  Follows the Apache Way
ODP Core Initial Projects

Quickly Showing Value To The Industry
Common core
HDP 2.2 Open Platform 4.0
with Apache Hadoop
IIP
Key benefits
Improves ecosystem interoperability
Unlocks customer choice
Eliminates wasteful guesswork
Respects the Apache way
Hortonworks, IBM, Pivotal and InfoSys Harmonize on Open Data Platform
Vision to Accelerate Big Data Solutions
Apache Hadoop 2.6 Apache Ambari
Pivotal HD 3.0

How You Can Participate
§  Anybody can join the ODP – Company
memberships start at $1k
§  Have a direct voice into the future of big data
§  Help us define priorities to solve your challenges
§  Join your peers and accelerate industry solutions
§  Contribute people, tests, and code to accelerate
executing on the vision
ODP - enabling Big Data
solutions to ﬂourish atop a
common core platform

Questions?

Webinar turbo charging_data_science_hawq_on_hdp_final

Recommended

Recommended

More Related Content

What's hot

What's hot (20)

Viewers also liked

Viewers also liked (20)

Similar to Webinar turbo charging_data_science_hawq_on_hdp_final

Similar to Webinar turbo charging_data_science_hawq_on_hdp_final (20)

More from Hortonworks

More from Hortonworks (20)

Recently uploaded

Recently uploaded (20)

Webinar turbo charging_data_science_hawq_on_hdp_final