How can Hadoop & SAP be integrated

Running together in Retail Environment
1
Author: Douglas Bernardini

Big Data Platform
3
 Collection of Hadoop & Apache solutions running
together and integrated
 Open-source: Apache Software Foundation.
 Works across component technologies and
integrates with pre-existing EDW, RDBMS and MPP
systems.
 Linux and Windows.
 Authentication, Authorization, & Data Protection.
 Native integration with Major BI/ analytics
developers & vendors.
HDP Platform overview
Real time
Ingest
Flume
Real time
Ingest
Storm
Batch
Integration
Sqoop
Integration
Processing
YARN
Storage
HDFS
Data Management
Data Access
Script/ETL
Pig
Process
Map
Reduce
SQL like
Hive
OnLine
Hbase
InMemory
Spark
Hortonworks Data Platform (HDP)

Big Data Platform
4
 Scalable: Store/Distribute very large data sets across
hundreds of servers operating in parallel, with thousands
of nodes involving thousands of terabytes of data.
 Cost effective: Savings are staggering, offering
computing and storage capabilities for hundreds of
dollars per terabyte.
 Flexible: an be used for a wide variety of purposes,
such as log processing, recommendation systems, data
warehousing, market campaign analysis and fraud
detection.
 Fast: able to efficiently process terabytes of data in just
minutes, and petabytes in hours
 Resilient to failure: data in individual nodeis also
replicated to other nodes in the cluster, which means
that in the event of failure.
Hadoop Technology Advantages & Profile
External: In almost off
cases from outside
corporation. Social
networks or suppliers
Source
Big: Normally used for
up to Tens/Hundreds of
terabytes. Petabyte
scale.
Size
Not structured: Data
not separated in
columns/rows or with
schema.
Structure

Data Management
5
Stores data in several clusters & servers
 NameNode and DataNodes
 Large volume: 200 PB of storage and a single cluster of
4500 servers, supporting close to a billion files and blocks;
 Minimal data motion: Hadoop moves compute processes
to the data on HDFS and not the other way around. Moving
Computation is Cheaper than Moving Data
 Dynamically diagnose: the health of the file system and
rebalance the data on different nodes;
 Rollback: Allows operators to bring back the previous
version of HDFS after an upgrade;
 Node redundancy: Supports high availability (HA);
Storage
Hadoop File System HDFS

Data Management
6
Manages HDFS
 Multi-tenancy: Multiple access engines to use
Hadoop as the common standard for batch,
interactive and real-time engines that can
simultaneously access the same data set.
 Cluster utilization: Dynamic allocation of cluster
resources improves utilization over more static
MapReduce rules used in early versions.
 Scalability: Data center processing power continues
to rapidly expand. Resource Manager scheduling
clusters expand to thousands of nodes managing
petabytes of data.
Processing
Hadoop YARN

Data Access
7
Process Data
 The Map function: Divides input data into ranges
(parts) by the InputFormat and creates a map task
for each range in the input.
 JobTracker distributes those tasks to the
worker nodes. The output of each map task is
partitioned into a group of key-value pairs for
each reduce.
 The Reduce function: Collects the various results
and combines them to answer the larger problem
that the master node needs to solve.
 reduce is able to collect the data from all of
the maps for the keys and combine them to
solve the problem.
Batch
MapReduce
No-structured
Data
Map Reduce Data Analysis

Data Integration & Governance
8
High volume data ingestion
 Stream data
 Ingest streaming data from multiple sources into
Hadoop for storage and analysis
 Guarantee data delivery
 Channel-based transactions to guarantee reliable
message delivery.
 When a message moves from one agent to another, two
transactions are started, one on the agent that delivers
the event and the other on the agent that receives the
event.
 This ensures guaranteed delivery semantics
 Scale horizontally
 To ingest new data streams: Additional volume
Real-time Ingest
Apache FLUME
No-structured
Data
Agent
Nodes
Collector
Nodes
HDFS
Storage
Area

9
Very-large volume data ingestion
 Fast – benchmarked as processing million+
messages/records per second per node
 Scalable – with parallel calculations that run
across a cluster of machines
 Fault-tolerant – when workers die, Storm will
automatically restart them. If a node dies, the
worker will be restarted on another node.
 Reliable – Storm guarantees that each unit of
data (tuple) will be processed at least once or
exactly once. Messages are only replayed when
there are failures.
Real-time Ingest
Apache STORM

10
Connects to traditional RDBMS
 Data imports: Moves certain data from external
stores and EDWs into Hadoop to optimize cost-
effectiveness of combined data storage and
processing
 Improvements: Compression, indexing for query
performance
 Parallel data transfer: For faster performance
and optimal system utilization
 Fast data copies: From external systems into
Hadoop
 Load balancing: Mitigates excessive storage
and processing loads to other systems
Batch Integration
Apache SQOOP
Efficient data analysis: Improves efficiency of data analysis by
combining structured data with unstructured data in a
schema-on-read data lake

Data Access
11
Easy programing language
 Easily programmed: Complex tasks
involving interrelated data transformations can
be simplified and encoded as data flow
sequences. Pig programs accomplish huge
tasks, but they are easy to write and maintain
 Iterative data processing: Extract-transform-
load (ETL) data pipelines. Research tools on
raw data.
 Extensible: Pig users can create custom
functions to meet their particular processing
requirements
 Self-optimizing: Because the system
automatically optimizes execution of Pig jobs,
the user can focus on semantics.
Script
Pig

Data Access
12
SQL like tools
 Familiar: Query data with a SQL-based language. similar to
tables in a relational database, and data units are organized in
a taxonomy from larger to more granular units.
 Fast: Interactive response times, even over huge datasets
 Partitioned: Each table can be sub-divided into partitions that
determine how data is distributed within sub-directories of the
table directory. Scalable and Extensible: As data variety and
volume grows, more commodity machines can be added,
without a corresponding reduction in performance.
 Uses JobTracker (MapReduce) functionalities
SQL
Hive

Data Access
13
NoSQL tools with SQLlike command interface
 Apache HBase is an open source NoSQL database that
provides real-time read/write access to those large datasets.
 Scales linearly to handle huge data sets with billions of rows
and millions of columns
 Easily combines data sources that use a wide variety of
different structures and schemas.
 Natively integrated with Hadoop and works seamlessly
alongside other data access engines through YARN.
 Choice for storing semi-structured data like log data.
OnLine
Hbase

Data Access
14
Fast, in-memory data processing to Hadoop.
 Elegant and expressive development APIs in Scala, Java, R,
and Python.
 Allow data workers to efficiently execute streaming, machine
learning or SQL workloads for fast iterative access to
datasets.
 Designed for data science and its abstraction makes data
science easier.
 Data scientists commonly use machine learning – a set of
techniques and algorithms that can learn from data. These
algorithms are often iterative
InMemory
Spark

16
Fast In-memory Database
 Traditional DBMS: SQL interface, Transactional isolation and
recovery (ACID).
 Parallel Data Flow Model: calculations can be executed in
parallel with distribution across hosts.
 Last generation Data Storage:
 Columnar and Row-Based
 Near to eliminate of indexes.
 High Data compression
 Automatic recovery: From memory errors without system
reboot.
 Native tools: Predictive Analysis Library & Analytical and
Special Interfaces
SAP/Hana Architecture
ERP/DW-BI Platform

17
100x faster
 Optimization: InfoCubes and Datastore Objects (DSO)
with better performance.
 Faster remodeling: Improved and Lean Data
Models.Simplified data modeling and reduced
materialized layers
 Datamarts: Integrated and embedded flexibility. Also
may have OLAP and OLTP are executed in one system.
 Increased Flexibility: Optimized Layered Scalable
Architecture. Aggregates and Cubes no more required
(optional).
 Improved response times: for existing transactions and
entire business processes through general performance
improvement of the underlying HANA database
SAP/Hana Technology Advantages & Profile
Big: Not considered BIG
for web 2.0 era. Tens of
terabytes. Not reaching
Petabytes.
Size
Structured: separated
in columns/rows and
with schema.
Structure
Internal: In almost off
cases from INSIDE
corporation, from
ERP/CRM/SCM.
Source
ERP/DW-BI Platform

SAP/Hana
Evolution
 Starting Point: SAP Landscape
consists of SAP ERP running on a
relational database, connected to a
OLAP engine (e.g. SAP BI) and
perhaps using Business Intelligence
Accelerator like BOBJ
Analytics
SAP/BOBJ
OLTP
SAP/ECC
ETL
OLAP
SAP/BW
ERP/DW-BI Platform
 Introducing HANA in parallel: Install
and run the In-Memory engine
(HANA) in TOGETHER with
traditional SAP instances
 02 BW extractors running at same
time and exporting same data
 Key factor: Real data performance
processing COMPARISON
Analytics
SAP/BOBJ
SAP/HANA
2nd ETL
Analytics
SAP/BOBJ
OLTP
SAP/ECC
ETL
OLAP
SAP/BW

SAP/Hana
Evolution
 BW database upgrade: Re-created
traditional-style BI in memory
ERP/DW-BI Platform
OLTP
SAP/ECC
OLAP
SAP/BW
ETL
Analytics
SAP/BOBJ
SAP/HANA
 ERP/BI full database upgrade:
Eliminate traditional database and
run both instances in In-Memory,
using non materialized views
OLTP
SAP/ECC Analytics
SAP/BOBJ
SAP/HANA
OLAP
BI 2.0

20
Sizing on SAP/Hana
ERP/DW-BI Platform
• Memory
• Traditional sizing:
• CPU performance <> SAP HANA memory.
• Master/transactional data > Main memory.
• Main memory required:
• Storing the business data;
• Temporary memory space ;
• Support complex queries;
• Buffers & Caches;
• CPU
• Behaves differently with SAP HANA compared to traditional
databases.
• Querys: Complex & Maximum speed
• Disk Size
• Still required disk storage space.
• Preserve database information if the system
shuts down (either intentionally or due to a
power loss)
• Data changes: Periodically copied to disk
(Ensures a full image of the business data on
disk)
• Logging mechanism: Enable system
recovery.

21
SAP/Hana on VM
ERP/DW-BI Platform
• SAP HANA on vSphere is fully supported
• Combining SAP HANA and vSphere provides additional
benefits with regards to deployment and availability.
• Some remaining customer slots for SAP on SAP HANA
controlled availability Proof Of Concepts
• SAP HANA > BlueMedora Plug-In
• Monitor memory and vCPU utilization
• Add/Delete resources
• Underutilized – Deploy more SAP HANA
• Over utilized – SAP HANA unleased
• Workload management
• Determine Consolidation Ratios

• AmazonAWS: SAP Partner
• SAP BW on HANA Trial - PoC
• The AWS server provides a HANA
• Ready to go in 30min
• OLAP: BW on HANA or any other data
warehouse application predominantly with
OLAP workloads including data marts
running a lot of complex queries
• OLTP: any transactional application like
Business Suite on HANA predominantly
running simple queries or CRUD operations
22
ERP/DW-BI Platform
SAP/Hana on Cloud

• Storage Replication:
• The storage itself replicates all data to another
location within one or between several data
centers. The technology is hardware vendor-
specific and multiple concepts are available on the
market.
• System Replication:
• SAP HANA replicates all data to another location
within one or between several data centers. The
technology is independent from hardware vendor
concepts and reusable with a changing
infrastructure.
23
ERP/DW-BI Platform
Disaster Recovery

Host Auto-Failover
• Standby mode:
• No data; requests or queries.
• When an active (worker) host fails, a standby host
automatically takes its place.
• Since the standby host can take over operations
from any of the primary hosts, it needs access to
all of the database volumes.
• Once repaired:
• The failed host can be rejoined to the system as
the new standby host to re-establish the failure
recovery capability:
24
ERP/DW-BI Platform
SAP/HANA High Availability

ERP/DW-BI & Big Data Platform
25

26
Not Structured Data
Structured Data
ERP/CRM/SCM
Real time
Ingest
Flume
Real time
Ingest
Storm
Batch
Integration
Sqoop
Integration
Processing
YARN
Storage
HDFS
Data Management
Data Access
Script/ETL
Pig
Process
Map
Reduce
SQL like
Hive
OnLine
Hbase
InMemory
Spark
Hortonworks Data Platform
Analytics
Data Repositories
HANA
OLAP
Engine
Predective
Engine
Spatial
Engine
Aplication
Logic & Rendering
Architecture Proposal

27
Business Case: CRM/Retail
Internal structured data source
• Point-of-sale data – Data captured when the
customer makes purchases either in-store or on the
company’s e-commerce site.(04T)
• Inventory and stock Information –Products are in
stock at which locations/promotion. (07T)
• CRM data – From all the interactions the customer
has had with the company at support site.(8T)
• Total data Size: 21T
External Unstructured data source
• Social media data – Customer’s social media
sentiment analysis, such Facebook (70T)
• Historical Web log information – Record of the
customer’s past browsing behavior on the company’s
Web site.(30T)
• Geographic customer behavior: Origin/Destiny
potential customer nearby stores. (20T)
• Total data Size: 120T

28
Business Case: Data Process

dbernard@d2-data.com
Questions?
29

How can Hadoop & SAP be integrated

Recommended

Recommended

More Related Content

What's hot

What's hot (20)

Viewers also liked

Viewers also liked (12)

Similar to How can Hadoop & SAP be integrated

Similar to How can Hadoop & SAP be integrated (20)

More from Douglas Bernardini

More from Douglas Bernardini (19)

Recently uploaded

Recently uploaded (20)

How can Hadoop & SAP be integrated