Leveraging SAP, Hadoop, and Big Data to Redefine Business

Leveraging SAP, Hadoop, and Big Data to Redefine
Business
Javier Cuerva | Enterprise Solution Architect | SAP Global CoE
April 16th, 2015 Public

© 2015 SAP SE or an SAP affiliate company. All rights reserved. 2Public
More Data, Different Data, Faster Data = Big Data
 Digital Universe Exploding*
– Until 2020, the digital universe will double
every 2 years, reaching 40 Zettabytes
 Relational data, not only anymore
 Machine generated data is the trend “du-jour”
– Internet of Things or Industrial Data
*source IDC: the digital universe in 20201 Zettabyte (ZB) = 1 million Petabytes (PB)

Big Data Economics
Generating significant financial value across sectors
$300 billion value per year
US healthcare
#0.7% annual GDP
Manufacturing
Up to -50% assembly costs
Up to +7% reduction in
working capital
Retail
60%+ increase in net margin
possible
0.5-1.0% annual GDP
Global personal location data
$100 billion+ revenue for
service providers
source IDC: McKinsey Global Institute Analysis

SAP Focus
End-to-End Value Chain
SPATIAL
PROCESSING
ANALYTICS, TEXT,
GRAPH, PREDICTIVE
ENGINES
CONSUME
COMPUTE
STORAGE
SOURCE
INGEST
Application
Development
Environment
Transformations &
Cleansing
Smart Data Integration
Smart Data Quality
Stream
Processing
Smart Data Streaming
STREAM
PROCESSING
LogsTextOLTP Social MachineGeoERP SensorStore & forward
Mobile applications and BI
Smart Data Access
Virtual
Tables
User Defined
Functions
1010100
1010110
1001110
Dynamic Tiering
Aged data
in Disk
In-Memory
Data model
& data
Calculation engine
Fast
computing
Column Storage
High performance
analytics
Series Data Storage
Store time-
series data
Reporting &
Dashboards
High Performance
Applications
Data Exploration
& Visualization
Adhoc & OLAP
Analytics
Predictive
Analysis
Business Planning
& Forecasting
Lumira / BI
Hadoop / NoSQL
MapReduce
YARN
HDFS
HANA DATA PLATFORM

HANA Data Management
Technical Foundation for End-to-End Big Data
In-Memory
Sub-second Response
Column Storage
High Performance
Analytics
Dynamic Tiering
Warm data to disk
Smart Data Access
Remote Source as
Virtual Tables
Virtual UDF
HDFS and
MapReduce
011001
Smart Data Streaming
On-the-fly Stream
Analysis
Smart Data Integration
Extend HANA with
Hadoop Stores
Smart Data Quality
Cleansing and
Transformation
Replication server
Real-time data
movement to Hadoop
Smart Data Preparation
Clean data for
better decisions
Data Services
Big Data and No-SQL
transformations
Aging Rules and
Automated Data
Movement from HANA
to Hadoop
Data Warehouse Foundation

HANA Data Platform
Big Data Features
HANA native BigData
 Dynamic Tiering
 Smart Data Streaming
 Graph | Geo | TimeSeries
HANA & Hadoop
 Smart Data Access  Hive | Spark
 MapReduce | HDFS

HANA Data Platform
Dynamic Tiering
HANA Dynamic Tiering
 Native Big Data solution – real-time
insights – ALL enterprise data
 Manage data cost effectively
 Terabytes to Petabytes
 Application defined temperature
 Single Database experience
 Centralized operational control
CREATE TABLE „demo“.“SalesOrders_WARM“ (
ID Integer NOT NULL,
CustomerID Integer NOT NULL,
OrderDate date NOT NULL,
…,
PRIMARY KEY (id)
) USING EXTENDED STORAGE;
INSERT INTO „demo“.“SalesOrders_WARM“ VALUES ( … );

HANA Data Platform
Hadoop Integration
HANA & Hadoop Integration
 SQL on Hadoop via SDA with virtual tables
– Hive or Spark
 Execution of MR-Jobs via virtual functions
 Access to HDFS
 Calculation view support for virtual tables

HANA Data Platform
Hive, Spark Integration
Feature Highlights
 Data Virtualization (Smart Data Access) to
Hive via ODBC connectivity
 Richer SQL access from SAP HANA studio
to Hive Tables
 Compute SQL operations between Hive
Tables and HANA Tables from HANA studio
 Remote caching of HIVE data in queries:
Ex: SELECT * FROM HIVE_LINEITEMS WHERE
ORDER_ID=6 WITH HINT ( USE_REMOTE_CACHE )

Virtual UDF for HDFS and MapReduce Integration
Architecture

Virtual UDF for HDFS and MapReduce Integration
Syntax
Highlights
 Syntax:
CREATE VIRTUAL FUNCTION <func_name> [(<parameter_clause>)]
RETURNS <return_table_type>
[SQL SECURITY <mode>]
[<package_clause>]
CONFIGURATION <remote_proc_properties>
AT <remote_source_name>;
 Virtual Function Properties
– Can be used in-place of a table or derived table where the return clause represents the result-set
– Many configuration parameters depending on HDFS or MapReduce Job Call
– Points to a remote Hadoop cluster defined by the CREATE REMOTE SOURCE DDL

HANA Data Platform
HDFS Integration
Feature Highlights
 Query native HDFS (Hadoop File System) data
 Read-only access to HDFS file
 vUDF needs to define the schema of the result set returned with the TABLE clause
 Some relevant configuration parameters, more in SPS09 Administration Guide
Parameter Name Description
hdfs_location Where the hdfs file is location, e.g. /user/hive/tpch/products
hdfs_field_delimiter The character which defines the separator between fields in the file pointed by hdfs_location
datetime_format Defines the ISO datetime format of a date_time column in the file
date_format Defines the ISO date format of a date column in the file, e.g yyyy-MM-dd
time_format Same for time format

HDFS Demo with Virtual UDF
Create First a Remote Server pointing to the WebHDFS and WebHCAT servers
 Use Remote Server Statement for that
Create a Virtual User Defined Function
 Pointing to the HDFS file and specifying the type of data returned
Access the HDFS file
 Call the vUDF

HANA Data Platform
Map Reduce Integration
Feature Highlights
 Capability to invoke MapReduce jobs from HANA
 End-to-End development:
– Define Mapper and Reducer JAVA classes developed in HANA studio by creating a Java Project with
the SAP HANA Development Perspective.
– MapReduce Deployment from HANA Studio
 vUDF needs to define the schema of the result set returned with the TABLE clause
 Some relevant configuration parameters, more in SPS09 Administration Guide
Parameter Name Description
mapred_mapper The full java class name for the map phase
mapred_reducer The full java class name for the reduce phase
mapred_input The initial file to be used by MapReduce or an intermediate result if chaining MapReduce calls or
the input directory to read the data from

MapReduce Demo with Virtual UDF
Create First a Remote Server pointing to the WebHDFS and WebHCAT servers
 Use Remote Server Statement for that
Create a Virtual User Defined Function
 Reference the Mapper Class Name
 Reference the Reducer Class Name
 Reference the input file location where the MapReduce Job should look for
Call the MapReduce Job
 Call the vUDF

HANA Data Platform and Hadoop
Where we are heading
Some relevant features:
 Lightweight and fast data replication/movement from
HANA to Hadoop
 Data Aging solution for HANA via Data Lifecycle
Management utility to define aging rules and relocate
aged data to Hadoop
 SDA support for Data Provisioning for the SAP HANA
Service/Adapter Framework
 SDA performance optimization: maintain statistics
 Optimize SAP HANA and Spark SQL Integration
 Leverage HANA/Hadoop Security capabilities for User
Authentication
 Single UI for HANA and Hadoop cluster Administration
& Monitoring (through Ambari)

Conclusion
Bringing Big Data to main stream Enterprise Data
ONE
PLATFORM
ALL
WORKLOADS
INTEGRATED
ALL DATA
SIMPLE
OPEN

Backup Slides

HANA Data Platform
Any Apps
Any App Server
SAP Business Suite and BW
ABAP App Server
Other AppsLocationReal-timeHADOOPMachineUnstructuredTransaction
HANA Platform
SQL, SQLScript, JavaScript
Spatial Text Search
Text
Analysis & Mining
Stored Procedure
& Data Models
Application &
UI Services
Business Function Library Predictive Analysis Library
Database
Services
Series Data
Rules
Engine
Integration & Steaming Services
SAP HANA is the platform for
ALL Applications
A true platform
 Converged OLTP + OLAP
 Native processing services
 Embedded business logic
Supports any application
 60% of HANA use cases are outside of the SAP Landscape
 1,300+ start-ups & ISVs developing on HANA
Supports any Device

SAP HANA Smart Data Integration & Smart Data Quality
Replication, Batch Integration, and Data Virtualization
Capabilities
 Real-time replication & CDC on select sources
 Bulk integration (metadata / data)
 Data virtualization via Smart Data Access
 Real-time data cleansing and transformation
 Data enrichment with geospatial information
 SAP HANA Studio to define data transformation flows
 Support for on-premise and cloud sources
 Open SDK and built-in adapters including HIVE
Benefits
 Simplified landscape: 1 environment to provision data
 Real-time: lower latency with in-memory performance
 Open & extensible: supports data of any shape or size
Built-In Adapters Custom Adapters
Transformations
SAP HANA
Metadata
Adapter
Framework
OData
DB2, Oracle
SQL Server
Smart Data IntegrationSmart Data Quality

SAP HANA Smart Data Access
Virtual Table
Capabilities
 Real-time, virtualized data access to external sources
 SAP Sources: HANA, ASE, IQ, MaxDB, ESP, SQLA
 Databases: Teradadata, Microsoft SQLServer, Oracle,
IBM DB2, IBM Netezza
 Hadoop: Hive ODBC Driver to Cloudera, Hortonworks,
MapR
 NoSQL: SPARK
Benefits
 Optimized performance
 Compliments existing enterprise investments
 Lower development costs by using data directly from its
source system

SAP BusinessObjects BI / SAP Lumira & Hadoop / NoSQL
Combined With SAP HANA
Hadoop / NoSQL
Hive
SQL Query
Impala
MPP SQL
Query
MongoDB
Document
DB
Cassandra
NoSQL DB
MapReduce / YARN / AWS Elastic MapReduce
Distributed Processing Framework
SAP HANA Platform
SAP
BusinessObjects BI
Data Integration
BI Universe
SAP
Lumira
Desktop
SAP
Lumira Cloud
Capabilities
SAP Lumira Desktop & SAP BusinessObjects BI
can integrate with Hadoop via SAP HANA
SAP BusinessObjects BI 4.0 FP 3(Universe)
integrates with Hive, Cloudera Implala and AWS
EMR
SAP Lumira desktop integrates with Hive and AWS
EMR
 SAP Lumira comes with a Datasource Extension Framework;
developers can use to build additional datasource access:
MongoDB, Datastax, SparkSQL are the most recent examples
 SAP Lumira cloud integrates with Hive (0.13),
Cloudera Impala (1.21), and AWS EMR
Benefits
 Flexible choice on how to access Hadoop / NoSQL
 Greater insight from Big Data Analytics
Smart data access
SAP Data Services 4.1 (Hive & HDFS)

SAP Predictive Analytics & Hadoop / NoSQL
SAP Predictive Analytics 2.0
Hadoop / NoSQL
Greenplum
SQL DB
Capabilities
 Unified UI for business analysts and data scientists
 Extensive predictive library including R algorithms
 Big Data ready with support of Hive and Spark, but
also Greenplum. Custom data extensions available
for HDFS and virtually to any NoSQL database
 Cloud services & SDK ready with full process
automation capabilities
Benefits
 Packable in business applications
 Improved prediction & insights from Big Data
analysis
Hive
SQL
Spark
In Memory Processing
HDFS
Hadoop Distributed File System

Leveraging SAP, Hadoop, and Big Data to Redefine Business

More Related Content

What's hot

Viewers also liked

Similar to Leveraging SAP, Hadoop, and Big Data to Redefine Business

More from DataWorks Summit

Leveraging SAP, Hadoop, and Big Data to Redefine Business