The strategic relationship between Hortonworks and SAP enables SAP to resell Hortonworks Data Platform (HDP) and provide enterprise support for their global customer base. This means SAP customers can incorporate enterprise Hadoop as a complement within a data architecture that includes SAP HANA, Sybase and SAP BusinessObjects enabling a broad range of new analytic applications.
3. Big Data Platform
3
Collection of Hadoop & Apache solutions running
together and integrated
Open-source: Apache Software Foundation.
Works across component technologies and
integrates with pre-existing EDW, RDBMS and MPP
systems.
Linux and Windows.
Authentication, Authorization, & Data Protection.
Native integration with Major BI/ analytics
developers & vendors.
HDP Platform overview
Real time
Ingest
Flume
Real time
Ingest
Storm
Batch
Integration
Sqoop
Integration
Processing
YARN
Storage
HDFS
Data Management
Data Access
Script/ETL
Pig
Process
Map
Reduce
SQL like
Hive
OnLine
Hbase
InMemory
Spark
Hortonworks Data Platform (HDP)
4. Big Data Platform
4
Scalable: Store/Distribute very large data sets across
hundreds of servers operating in parallel, with thousands
of nodes involving thousands of terabytes of data.
Cost effective: Savings are staggering, offering
computing and storage capabilities for hundreds of
dollars per terabyte.
Flexible: an be used for a wide variety of purposes,
such as log processing, recommendation systems, data
warehousing, market campaign analysis and fraud
detection.
Fast: able to efficiently process terabytes of data in just
minutes, and petabytes in hours
Resilient to failure: data in individual nodeis also
replicated to other nodes in the cluster, which means
that in the event of failure.
Hadoop Technology Advantages & Profile
External: In almost off
cases from outside
corporation. Social
networks or suppliers
Source
Big: Normally used for
up to Tens/Hundreds of
terabytes. Petabyte
scale.
Size
Not structured: Data
not separated in
columns/rows or with
schema.
Structure
5. Data Management
5
Stores data in several clusters & servers
NameNode and DataNodes
Large volume: 200 PB of storage and a single cluster of
4500 servers, supporting close to a billion files and blocks;
Minimal data motion: Hadoop moves compute processes
to the data on HDFS and not the other way around. Moving
Computation is Cheaper than Moving Data
Dynamically diagnose: the health of the file system and
rebalance the data on different nodes;
Rollback: Allows operators to bring back the previous
version of HDFS after an upgrade;
Node redundancy: Supports high availability (HA);
Storage
Hadoop File System HDFS
6. Data Management
6
Manages HDFS
Multi-tenancy: Multiple access engines to use
Hadoop as the common standard for batch,
interactive and real-time engines that can
simultaneously access the same data set.
Cluster utilization: Dynamic allocation of cluster
resources improves utilization over more static
MapReduce rules used in early versions.
Scalability: Data center processing power continues
to rapidly expand. Resource Manager scheduling
clusters expand to thousands of nodes managing
petabytes of data.
Processing
Hadoop YARN
7. Data Access
7
Process Data
The Map function: Divides input data into ranges
(parts) by the InputFormat and creates a map task
for each range in the input.
JobTracker distributes those tasks to the
worker nodes. The output of each map task is
partitioned into a group of key-value pairs for
each reduce.
The Reduce function: Collects the various results
and combines them to answer the larger problem
that the master node needs to solve.
reduce is able to collect the data from all of
the maps for the keys and combine them to
solve the problem.
Batch
MapReduce
No-structured
Data
Map Reduce Data Analysis
8. Data Integration & Governance
8
High volume data ingestion
Stream data
Ingest streaming data from multiple sources into
Hadoop for storage and analysis
Guarantee data delivery
Channel-based transactions to guarantee reliable
message delivery.
When a message moves from one agent to another, two
transactions are started, one on the agent that delivers
the event and the other on the agent that receives the
event.
This ensures guaranteed delivery semantics
Scale horizontally
To ingest new data streams: Additional volume
Real-time Ingest
Apache FLUME
No-structured
Data
Agent
Nodes
Collector
Nodes
HDFS
Storage
Area
9. Data Integration & Governance
9
Very-large volume data ingestion
Fast – benchmarked as processing million+
messages/records per second per node
Scalable – with parallel calculations that run
across a cluster of machines
Fault-tolerant – when workers die, Storm will
automatically restart them. If a node dies, the
worker will be restarted on another node.
Reliable – Storm guarantees that each unit of
data (tuple) will be processed at least once or
exactly once. Messages are only replayed when
there are failures.
Real-time Ingest
Apache STORM
10. Data Integration & Governance
10
Connects to traditional RDBMS
Data imports: Moves certain data from external
stores and EDWs into Hadoop to optimize cost-
effectiveness of combined data storage and
processing
Improvements: Compression, indexing for query
performance
Parallel data transfer: For faster performance
and optimal system utilization
Fast data copies: From external systems into
Hadoop
Load balancing: Mitigates excessive storage
and processing loads to other systems
Batch Integration
Apache SQOOP
Efficient data analysis: Improves efficiency of data analysis by
combining structured data with unstructured data in a
schema-on-read data lake
11. Data Access
11
Easy programing language
Easily programmed: Complex tasks
involving interrelated data transformations can
be simplified and encoded as data flow
sequences. Pig programs accomplish huge
tasks, but they are easy to write and maintain
Iterative data processing: Extract-transform-
load (ETL) data pipelines. Research tools on
raw data.
Extensible: Pig users can create custom
functions to meet their particular processing
requirements
Self-optimizing: Because the system
automatically optimizes execution of Pig jobs,
the user can focus on semantics.
Script
Pig
12. Data Access
12
SQL like tools
Familiar: Query data with a SQL-based language. similar to
tables in a relational database, and data units are organized in
a taxonomy from larger to more granular units.
Fast: Interactive response times, even over huge datasets
Partitioned: Each table can be sub-divided into partitions that
determine how data is distributed within sub-directories of the
table directory. Scalable and Extensible: As data variety and
volume grows, more commodity machines can be added,
without a corresponding reduction in performance.
Uses JobTracker (MapReduce) functionalities
SQL
Hive
13. Data Access
13
NoSQL tools with SQLlike command interface
Apache HBase is an open source NoSQL database that
provides real-time read/write access to those large datasets.
Scales linearly to handle huge data sets with billions of rows
and millions of columns
Easily combines data sources that use a wide variety of
different structures and schemas.
Natively integrated with Hadoop and works seamlessly
alongside other data access engines through YARN.
Choice for storing semi-structured data like log data.
OnLine
Hbase
14. Data Access
14
Fast, in-memory data processing to Hadoop.
Elegant and expressive development APIs in Scala, Java, R,
and Python.
Allow data workers to efficiently execute streaming, machine
learning or SQL workloads for fast iterative access to
datasets.
Designed for data science and its abstraction makes data
science easier.
Data scientists commonly use machine learning – a set of
techniques and algorithms that can learn from data. These
algorithms are often iterative
InMemory
Spark
16. 16
Fast In-memory Database
Traditional DBMS: SQL interface, Transactional isolation and
recovery (ACID).
Parallel Data Flow Model: calculations can be executed in
parallel with distribution across hosts.
Last generation Data Storage:
Columnar and Row-Based
Near to eliminate of indexes.
High Data compression
Automatic recovery: From memory errors without system
reboot.
Native tools: Predictive Analysis Library & Analytical and
Special Interfaces
SAP/Hana Architecture
ERP/DW-BI Platform
17. 17
100x faster
Optimization: InfoCubes and Datastore Objects (DSO)
with better performance.
Faster remodeling: Improved and Lean Data
Models.Simplified data modeling and reduced
materialized layers
Datamarts: Integrated and embedded flexibility. Also
may have OLAP and OLTP are executed in one system.
Increased Flexibility: Optimized Layered Scalable
Architecture. Aggregates and Cubes no more required
(optional).
Improved response times: for existing transactions and
entire business processes through general performance
improvement of the underlying HANA database
SAP/Hana Technology Advantages & Profile
Big: Not considered BIG
for web 2.0 era. Tens of
terabytes. Not reaching
Petabytes.
Size
Structured: separated
in columns/rows and
with schema.
Structure
Internal: In almost off
cases from INSIDE
corporation, from
ERP/CRM/SCM.
Source
ERP/DW-BI Platform
18. SAP/Hana
Evolution
Starting Point: SAP Landscape
consists of SAP ERP running on a
relational database, connected to a
OLAP engine (e.g. SAP BI) and
perhaps using Business Intelligence
Accelerator like BOBJ
Analytics
SAP/BOBJ
OLTP
SAP/ECC
ETL
OLAP
SAP/BW
ERP/DW-BI Platform
Introducing HANA in parallel: Install
and run the In-Memory engine
(HANA) in TOGETHER with
traditional SAP instances
02 BW extractors running at same
time and exporting same data
Key factor: Real data performance
processing COMPARISON
Analytics
SAP/BOBJ
SAP/HANA
2nd ETL
Analytics
SAP/BOBJ
OLTP
SAP/ECC
ETL
OLAP
SAP/BW
19. SAP/Hana
Evolution
BW database upgrade: Re-created
traditional-style BI in memory
ERP/DW-BI Platform
OLTP
SAP/ECC
OLAP
SAP/BW
ETL
Analytics
SAP/BOBJ
SAP/HANA
ERP/BI full database upgrade:
Eliminate traditional database and
run both instances in In-Memory,
using non materialized views
OLTP
SAP/ECC Analytics
SAP/BOBJ
SAP/HANA
OLAP
BI 2.0
20. 20
Sizing on SAP/Hana
ERP/DW-BI Platform
• Memory
• Traditional sizing:
• CPU performance <> SAP HANA memory.
• Master/transactional data > Main memory.
• Main memory required:
• Storing the business data;
• Temporary memory space ;
• Support complex queries;
• Buffers & Caches;
• CPU
• Behaves differently with SAP HANA compared to traditional
databases.
• Querys: Complex & Maximum speed
• Disk Size
• Still required disk storage space.
• Preserve database information if the system
shuts down (either intentionally or due to a
power loss)
• Data changes: Periodically copied to disk
(Ensures a full image of the business data on
disk)
• Logging mechanism: Enable system
recovery.
21. 21
SAP/Hana on VM
ERP/DW-BI Platform
• SAP HANA on vSphere is fully supported
• Combining SAP HANA and vSphere provides additional
benefits with regards to deployment and availability.
• Some remaining customer slots for SAP on SAP HANA
controlled availability Proof Of Concepts
• SAP HANA > BlueMedora Plug-In
• Monitor memory and vCPU utilization
• Add/Delete resources
• Underutilized – Deploy more SAP HANA
• Over utilized – SAP HANA unleased
• Workload management
• Determine Consolidation Ratios
22. • AmazonAWS: SAP Partner
• SAP BW on HANA Trial - PoC
• The AWS server provides a HANA
• Ready to go in 30min
• OLAP: BW on HANA or any other data
warehouse application predominantly with
OLAP workloads including data marts
running a lot of complex queries
• OLTP: any transactional application like
Business Suite on HANA predominantly
running simple queries or CRUD operations
22
ERP/DW-BI Platform
SAP/Hana on Cloud
23. • Storage Replication:
• The storage itself replicates all data to another
location within one or between several data
centers. The technology is hardware vendor-
specific and multiple concepts are available on the
market.
• System Replication:
• SAP HANA replicates all data to another location
within one or between several data centers. The
technology is independent from hardware vendor
concepts and reusable with a changing
infrastructure.
23
ERP/DW-BI Platform
Disaster Recovery
24. Host Auto-Failover
• Standby mode:
• No data; requests or queries.
• When an active (worker) host fails, a standby host
automatically takes its place.
• Since the standby host can take over operations
from any of the primary hosts, it needs access to
all of the database volumes.
• Once repaired:
• The failed host can be rejoined to the system as
the new standby host to re-establish the failure
recovery capability:
24
ERP/DW-BI Platform
SAP/HANA High Availability
26. ERP/DW-BI & Big Data Platform
26
Not Structured Data
Structured Data
ERP/CRM/SCM
Real time
Ingest
Flume
Real time
Ingest
Storm
Batch
Integration
Sqoop
Integration
Processing
YARN
Storage
HDFS
Data Management
Data Access
Script/ETL
Pig
Process
Map
Reduce
SQL like
Hive
OnLine
Hbase
InMemory
Spark
Hortonworks Data Platform
Analytics
Data Repositories
HANA
OLAP
Engine
Predective
Engine
Spatial
Engine
Aplication
Logic & Rendering
Architecture Proposal
27. ERP/DW-BI & Big Data Platform
27
Business Case: CRM/Retail
Internal structured data source
• Point-of-sale data – Data captured when the
customer makes purchases either in-store or on the
company’s e-commerce site.(04T)
• Inventory and stock Information –Products are in
stock at which locations/promotion. (07T)
• CRM data – From all the interactions the customer
has had with the company at support site.(8T)
• Total data Size: 21T
External Unstructured data source
• Social media data – Customer’s social media
sentiment analysis, such Facebook (70T)
• Historical Web log information – Record of the
customer’s past browsing behavior on the company’s
Web site.(30T)
• Geographic customer behavior: Origin/Destiny
potential customer nearby stores. (20T)
• Total data Size: 120T
28. ERP/DW-BI & Big Data Platform
28
Business Case: Data Process