Integration of SAP HANA with Hadoop

9,892 views

Published on

This is a point of view document showing the various possible techniques to integrate SAP HANA and Hadoop and their pros & cons and the scenarios where each of them is recommended.

Published in: Software, Travel, Technology
1 Comment
22 Likes
Statistics
Notes
  • Curious -- when you write "Data transformation is not possible while using Smart Data Access." I'm not sure what you mean. What is your definition of "transformation?" You are pushing part of a SQL query out to Hadoop... that query could do some transformation (as supported by SQL's capabilities) at the Hadoop level. Then there's the "main" part of the query that is procssed in HANA, which could change the data more. SQL can do pretty much any transformation of the data you'd do in, for example, populating a data warehouse (witness Oracle's SQL EL-T products Oracle Warehouse Builder and Oracle Data Integrator for examples). There are really advanced transformations like match/merge/deduplication of data that are too much for SQL but most things you'd want to do, you can do. Or are there limits to what HANA SDA can do in SQL?
       Reply 
    Are you sure you want to  Yes  No
    Your message goes here
No Downloads
Views
Total views
9,892
On SlideShare
0
From Embeds
0
Number of Embeds
120
Actions
Shares
0
Downloads
0
Comments
1
Likes
22
Embeds 0
No embeds

No notes for slide

Integration of SAP HANA with Hadoop

  1. 1. Page 1 Author – Ramkumar Rajendran Integration of SAP HANA with Hadoop
  2. 2. Page 2 Author – Ramkumar Rajendran Author Biography Ramkumar Rajendran Ramkumar Rajendran is a Consultant at a leading firm with an experience of 4 years. He has specialized in various tools like SAP HANA, SAP BI, SAP BO (Xcelsius, Webi and IDT), Tableau, Lumira and Hadoop- Hive. He has worked upon the Sentiment Analysis of Twitter data. He has involved in the integration of HANA and Hadoop. He has worked on multiple implementation projects for various industry sectors.
  3. 3. Page 3 Author – Ramkumar Rajendran Table of Contents 1 About this document.....................................................................................4 2 Introduction..................................................................................................5 SAP HANA......................................................................................................5 Hadoop..........................................................................................................5 3 Combined Potential of HANA and Hadoop..........Error!Bookmark notdefined. 4 Scenarios of Hadoop and Hana integration....................................................7 Federated Data Query through Smart Data Access (SDA).................................8 Business Objects Data Services.......................................................................9 SQOOP ........................................................................................................10 JAVA Program..............................................................................................12 5 Summary.....................................................................................................13 6 Reference Material......................................................................................13
  4. 4. Page 4 Author – Ramkumar Rajendran About this document This document would be talking about the combined potential of the in-memory database ’SAP HANA’ and the bigdata solution ‘Hadoop’ and the various methods of integration of both these technologies and the scenarios where each of these methods would be applicable . SAP HANA is specialized in real-time in-memory processing, while Hadoop is apt for massive parallel processing. Integration of both these technologies would have the advantages from both of them. Hadoop handles both structured and unstructured data from social media, machine logs, etc. which can be further used along with the transactional data present in HANA resulting in more mature business analysis. This document has been prepared based upon SAP HANA SP6 and Hadoop CDH 4.5.
  5. 5. Page 5 Author – Ramkumar Rajendran Introduction SAP HANA SAP HANA is an innovative in-memory database and data management platform, specifically developed to take full advantage of the capabilities provided by modern hardware to increase application performance. By keeping all relevant data in main memory, data processing operations are significantly accelerated. Design for scalability is a core SAP HANA principle. SAP HANA can be distributed across many multiple hosts to achieve scalability in terms of both data volume and user concurrency. Unlike clusters, distributed HANA systems also distribute the data efficiently, achieving high scaling without I/O locks. The key performance indicators of SAP HANA appeal to many of our customers, and thousands of deployments are in progress. SAP HANA has become the fastest growing product in SAP’s 40+ year history. Hadoop Hadoop is an open source software project that enables the distributed processing of large data sets across clusters of commodity servers. It is designed to scale up from a single server to thousands of machines, with a very high degree of fault tolerance. Rather than relying on high-end hardware, the resiliency of these clusters comes from the software’s ability to detect and handle failures at the application layer. Hadoop is known for its massive parallel processing capabilities on large datasets. It is also scalable, cost effective owing to cheaper processers, flexible and fault tolerant.
  6. 6. Page 6 Author – Ramkumar Rajendran CombinedPotential of HANAand Hadoop Hadoop can store very huge amount of data. It is well suited for storing unstructured data, is good for manipulating very large files and is tolerant to hardware and software failures. But the main challenge with Hadoop is getting information out of this huge data in real time. HANA is well suited for processing data in real time, thanks to its in-memory technology. By integrating Hadoop’s massive parallel processing and HANA’s in-memory computing capabilities the resultant solution would be capable of the following:  Accommodation of both structured and un-structured data.  Provision of cost efficient data storage and processing for large volumes data.  Computation of complex Information Processing.  Enabling heavily recursive algorithms, machine learning and queries that cannot be easily expressed in SQL.  Low Value Data Archive & Data stays available, though access is slower.  Mine raw data that is either schema-less or where schema changes over time.
  7. 7. Page 7 Author – Ramkumar Rajendran Scenarios ofHadoopand Hana integration Smart Data Access Business Objects Data Services SQOOP Java Federated Data Query through Smart Data Access(SDA) Hadoop Reporting Tools SDA Data Loading from Hadoop to HANA Hadoop SAP HANA Reporting Tools BODS Data Loading with Java Programming Hadoop SAP HANA Reporting Tools Java Hadoop SAP HANA Reporting Tools Data Loading from Hadoop to HANA SQOOOP PULL mechanism PUSH mechanism PUSH or PULL mechanism SAP HANA No Data Loading
  8. 8. Page 8 Author – Ramkumar Rajendran Federated Data Query throughSmart Data Access (SDA) SAP HANA smart data access enables remote Hadoop data to be accessed as if they are local tables in SAP HANA, without loading the data into SAP HANA. Not only does this capability provide operational and cost benefits, but most importantly it supports the development and deployment of the next generation of analytical applications which require the ability to access, synthesize and integrate data from multiple systems in real-time regardless of where the data is located or what systems are generating it. Specifically in SAP HANA, we can create virtual tables which point to remote tables in Hadoop. Customers can then write SQL queries in SAP HANA, which could operate on virtual tables. The SAP HANA query processor optimizes these queries, and executes the relevant part of the query in the target database, returns the results of the query to SAP HANA, and completes the operation. Recommended Scenarios Using SDA to access Hadoop from HANA would involve federated query being fired on Hadoop with the execution of the report. This technique is recommended when large amount of result set gets generated at Hadoop when the reporting query is fired. Smart Data Access involves aggregating the dataset at Hadoop using its system resources, resulting in the transfer of only end results from Hadoop to HANA. Advantages of this technique  Real-time data access from Hadoop without actually having to load it into HANA  Helps in scenarios where the data residing in Hadoop is updated very frequently and data loading would make no sense.  Query can be optimized by pushing the processing down to Hadoop, as it will return aggregated data. Disadvantages of this technique  Federated Query gets slowed down when huge processing needs to be done on the data at Hadoop end.  Data transformation is not possible while using Smart Data Access.
  9. 9. Page 9 Author – Ramkumar Rajendran  With this technique the reporting query would also be fired on Hadoop, which makes it critical for it to be up at all times. In cases of multiple Hadoop systems, it would become more potent of risk.  Data can only be extracted from HIVE.  Data access can happen only from Hadoop to HANA. Business Objects Data Services SAP Data Services delivers a single enterprise-class solution for data integration, data quality, data profiling and text data processing. This technique involves data PULL mechanism from Hadoop to HANA; so the entire control is based on BODS. This wide range of features helps to -  Integrate, transform, improve, and deliver trusted data from Hadoop to HANA  Provides development user interfaces, a metadata repository, a data connectivity layer, a run-time environment, and a management console enabling IT organizations to lower total cost of ownership and accelerate time to value.  Enable IT organizations to maximize operational efficiency with a single solution to improve data quality and gain access to heterogeneous sources and applications. Recommended Scenarios Integrating HANA with Hadoop using BODS would involve data loading on a timely manner. This can be utilized in scenarios where there is not requirement of real-time reporting, but involves complex calculations on large datasets. This technique would prove very effective in scenarios which involve multiple Hadoop systems with variety of unstructured data to be processed on a large scale.
  10. 10. Page 10 Author– Ramkumar Rajendran Advantages of this technique  Unstructured data can be loaded from Hadoop to HANA with all the transformation done while data loading.  It is better suited for loading of large dataset.  BODS can be utilized to implement complex transformations while loading data from Hadoop to HANA.  Performance of HANA can be improved by moving complex calculations to BODS.  Its Error Handling aspect helps in better support and maintenance.  Data encryption function to encrypt sensitive data is one of the niche aspects of data loading through BODS.  Centralized monitoring favors better IT support.  Delta loads are also supported.  Data transfer can happen from both the sides. Disadvantages of this technique  Data present in Hadoop cannot be availed on a real time basis since BODS loads data from Hadoop to HANA as a batch job. SQOOP SQOOP is a tool designed for efficiently transferring bulk data between Hadoop and structured data stores like Oracle, MsSQL, SAP HANA, etc. SQOOP can be used to import data from external structured data stores into Hadoop Distributed File System or related systems like Hive and HBase. Conversely, SQOOP can be used to extract data from Hadoop and export it to external structured data stores such as relational databases and enterprise data warehouses. SQOOP provides a pluggable connector mechanism for optimal connectivity to external systems. The SQOOP extension API provides a convenient framework for building new connectors. New connectors can be dropped into SQOOP installations to provide connectivity to various systems. SQOOP itself comes bundled with various connectors that can be used for popular database and data warehousing systems.
  11. 11. Page 11 Author– Ramkumar Rajendran By utilizing SQOOP data transfer would be automated through batch jobs and it utilizes the native tools for high performance data transfer. It uses data store metadata to infer structure definitions. It utilizes the MapReduce framework of Hadoop to transfer data in parallel, which proves fruitful for huge amount of data. It provides an extension mechanism to incorporate high performance connectors for external systems. For exporting data to external targets, SQOOP supports the functionality of Staging Tables which considerably improves the efficiency of data transfer and also acts as insulation from data corruption during times of failure. This technique involves PUSH mechanism to load data from Hadoop to HANA; so the entire control is based upon SQOOP in Hadoop. Recommended Scenarios SQQOP is a component in Hadoop which helps in data transfer from HDFS to external databases and vice versa. This technique of integrating SAP HANA with Hadoop would involve periodic loading of data directly from the underlying Hadoop files to HANA tables. SQOOP doesn’t support any transformation while transferring data. Hence this technique can be used in scenarios which require no real-time reporting and readily formatted source data which requires no cleansing. Also this would be most suited for bulk data transfers since SQOOP uses the underlying MapReduce framework of Hadoop enabling parallel data transfer. Advantages of this technique  It is better suited for loading of bulk datasets.  Data transfers can happen from both the sides.  It is open-source and hence cost-effective. Disadvantages of this technique  Data present in Hadoop cannot be availed on a real time basis since SQOOP loads data from Hadoop to HANA as a batch job.  No cleansing and formatting on the data can be done with SQOOP.
  12. 12. Page 12 Author– Ramkumar Rajendran JAVA Program Java program can be used to load data from Hadoop to HANA through JDBC connectivity. This technique of HANA-Hadoop offers very high level of customization in terms of cleansing, transformation, refining, filtering, etc. We can implement both PUSH and PULL mechanism to transfer data from Hadoop to HANA, depending upon where the program is installed and scheduled. Recommended Scenarios Data transfer from Hadoop to HANA is recommended in scenarios where it involves very less data transfer. This technique offers very high level of control with the developers; so they can come with a very customizable solution. Advantages of this technique  It offers customization at a greater extent.  Java is open source; and hence it would be a cost-effective solution.  Java program can be executed from the command line and doesn’t require any additional setup to host. Disadvantages of this technique  It would require high level of programming skills.  Error tracking and debugging becomes difficult.
  13. 13. Page 13 Author– Ramkumar Rajendran Summary The integration of HANA with Hadoop enables customers to move data between Hive and Hadoop’s Distributed File System and SAP HANA. Hadoop is good at processing bulk data at a very cheaper rate. Hence if a particular junk of data is not much valuable to the users, and they don’t access them often, storing it in HANA will be cost-prohibitive. By combining SAP HANA and Hadoop together, customers get the power of instant access with SAP HANA and infinite scale with Hadoop. This gives SAP users a broad range of options for storing and analyzing new types of data and the ability to create applications that can uncover new business opportunities from vast amounts of data that would not have been previously possible. References http://blog.cloudera.com/blog/ https://www.brighttalk.com/webcast/9727/86361 http://scn.sap.com/community/developer-center/hana/blog/2014/01/27/exporting-and-importing- data-to-hana-with-hadoop-sqoop http://www.saphana.com/docs/DOC-2934

×