Analyzing Business as It HappensSAP In-Memory Appliance Software (SAP HANA™)runs on the Intel® Xeon® processor to generate...
1 Introduction. . . . . . . . . . . . . . . . . . . . . . . . . . . . 1   2 Intel and SAP Technology Advances Drive    2 I...
2.1.1 The Rise of Mainstream Server Parallelism              One key solution has been to produce multi-coreAt one time, b...
2.1.3 Other Architecture Features that Benefit SAP HANA    Successive generations of Intel® server platforms introduce    ...
2.2 SAP Software Technology Innovations                        • A calculation and planning engine with a dataAround 2000,...
2.2.3 SAP HANA: The Next Generation of Real-Time                       SAP HANA benefits dramatically from the design feat...
3.1 Performance Implications of Column versus                        Columnar storage may be more efficient in the common ...
3.2 Dictionary Compression with Column Storage                          3.3 Efficiencies from Separation of Main    Dictio...
3.4 Storage of and Queries against Historical Data                  4 Collaboration by SAP and Intel for PerformanceAs an ...
In NUMA systems (Figure 6b), each processor is connected                          In multi-threaded application software, ...
Table                         Aggregation                          Thread 1                                               ...
4.3 Optimization for Core-Level Features                                In large part, achieving the optimal performance b...
4.3.3 Benefits from Intel® Turbo Boost TechnologyIntel® Turbo Boost Technology4 automatically allows processor cores to ru...
SAP HANA™ 1.0 Performance Comparison:                                                          Intel® Xeon® Processor E7 F...
Instead of using 32 bits per integer, SAP HANA uses only as many bits as are needed to represent the highest distinct valu...
5 Enhancing Reliability for In-Memory       Computing Systems     5.1 In-Memory Recovery     SAP HANA includes software ca...
Hana Intel SAP Whitepaper
Hana Intel SAP Whitepaper
Hana Intel SAP Whitepaper
Hana Intel SAP Whitepaper
Upcoming SlideShare
Loading in...5

Hana Intel SAP Whitepaper


Published on

Published in: Technology, Business
1 Like
  • Be the first to comment

No Downloads
Total Views
On Slideshare
From Embeds
Number of Embeds
Embeds 0
No embeds

No notes for slide

Hana Intel SAP Whitepaper

  1. 1. Analyzing Business as It HappensSAP In-Memory Appliance Software (SAP HANA™)runs on the Intel® Xeon® processor to generatesuperior, real-time business intelligence.WHITE PAPERIntel® Xeon® Processor E7 FamilySAP In-Memory Appliance Software (SAP HANA™)April 2011 | Version 1.01 IntroductionIncreasingly sophisticated business decision models SAP In-Memory Appliance Software (SAP HANA™) deliversdepend on extremely fast access to and manipulation of next-generation in-memory computing through an ongoingmassive data stores. Insight into business operations often engineering collaboration between SAP and Intel to providedemands data volumes that are beyond the capabilities of optimized performance and reliability on Intel® architecture,traditional disk-based systems to process them in real time, including the Intel® Xeon® processor E7 family.which can limit access to benefits such as the following: This paper introduces SAP In-Memory Computing• Efficiency provided by the ability to respond in technology concepts, explains how SAP HANA uses them real time to the changing needs of the business to help deliver real-time business insight, and describes how synergies with Intel architecture-based server• Flexibility based on insight that accurately directs platforms drive solution value. The discussion begins with quick action an examination of the hardware and software innovations• Empowerment of business users to make and that set the stage for SAP In-Memory Computing, including act on smart decisions the ways in which the architectures of Intel Xeon processorsThis reality often causes a separation between ongoing and SAP HANA complement one needs and the analytic applications that support Next, the paper describes specific optimizations made asthem. In such cases, the lag time between data gathering part of the ongoing, collaborative engineering relationshipand interpretation delays the ability to benefit from business between Intel and SAP. In the context of general databaseinformation, thereby limiting its value. concepts, the paper illustrates how SAP HANA takes advantage of Intel architecture to deliver high performance and reliability, as well as excellent support for application developers working with SAP HANA. Next, the paper describes real-world customer benefits derived from SAPSAP In-Memory Appliance Software (SAP HANA implementations. Finally, the paper explores howHANA™) uses SAP In-Memory Computing future technology may drive even further the value of jointtechnology and the performance and solutions based on Intel Xeon processors and SAP HANA.reliability of the Intel Xeon processor E7 ® ®family to dramatically improve the speed ofdata analysis, enabling business intelligenceto be generated in real time. That capabilityincreases the value of business data with moresophisticated decision support than has beenpossible with solutions based on traditionaldisk-based systems.
  2. 2. 1 Introduction. . . . . . . . . . . . . . . . . . . . . . . . . . . . 1 2 Intel and SAP Technology Advances Drive 2 Intel and SAP Technology Advances Database Evolution Drive Database Evolution . . . . . . . . . . . . . . . . . . . . 2 The evolution of Intel architecture toward greater performance 2.1 Intel Hardware Technology Innovations . . . . . . . . . . . 2 at lower cost sets the stage for the advances of SAP HANA. 2.1.1 The Rise of Mainstream Server Parallelism . . . . . . . . 2 The SAP In-Memory Computing Database that lies at the heart 2.1.2 The Drive to 64-Bit Processing . . . . . . . . . . . . . . 3 of this technology is engineered to take advantage of each 2.1.3 Other Architecture Features that Benefit SAP HANA . . . 4 aspect of a high-performance, balanced compute platform: 2.2 SAP Software Technology Innovations . . . . . . . . . . . 5 2.2.1 Establishing a Solid Database • Memory. Increasingly large amounts of addressable for In-Memory Computing . . . . . . . . . . . . . . . . . 5 memory, running at ever-higher speeds, combine with 2.2.2 Delivering the SAP In-Memory Computing growing caches and other improvements to create Database into the Mainstream . . . . . . . . . . . . . . . 5 2.2.3 SAP HANA: The Next Generation memory subsystems robust enough to easily support of Real-Time Business Analysis . . . . . . . . . . . . . . 5 the demands of SAP In-Memory Computing. 3 Creating a Transition in Database Design . . . . . . . . . . . 6 • Processing. Constant architectural improvements to 3.1 Performance Implications of Column versus Row Storage . . . . . . . . . . . . . . . . . . . . . 7 complete more work per clock cycle extend the value 3.2 Dictionary Compression with Column Storage . . . . . . . 8 of growing parallelism from multi-core designs, Intel® 3.3 Efficiencies from Separation of Main and Delta Storage. . 8 Hyper-Threading Technology (Intel® HT Technology), 3.4 Storage of and Queries against Historical Data . . . . . . 9 and increasingly scalable symmetric multi-processing. 3.5 Boosting Performance by Moving • I/O. Very fast on-board data transfer using Intel® Results Instead of Source Data . . . . . . . . . . . . . . . 9 QuickPath Interconnect is complemented by advances 4 Collaboration by SAP and Intel for Performance Optimization and Parallelization . . . . . . . . . . . . . . . . 9 in inter-node communication, to enable the low latency 4.1 Benefits from Multi-Socket Topology and NUMA . . . . . . 9 in data transfer within and between nodes in the SAP 4.2 Optimization for Multi-Core Processing. . . . . . . . . . 10 HANA appliance. 4.2.1 Parallel Aggregation . . . . . . . . . . . . . . . . . . . 10 4.2.2 Processor Cache Optimizations . . . . . . . . . . . . . 11 2.1 Intel Hardware Technology Innovations 4.3 Optimization for Core-Level Features . . . . . . . . . . . 12 Intel has continually led the industry in driving advances 4.3.1 Optimization for Intel HT Technology . . . . . . . . . . 12 ® in commercial off-the-shelf server platforms that increase 4.3.2 Standards Compliance and Accuracy performance, scalability, and value. These innovations for Decimal Floating-Point Math . . . . . . . . . . . . . 12 directly enable SAP HANA to transcend barriers to what 4.3.3 Benefits from Intel® Turbo Boost Technology . . . . . . 13 businesses are able to do with their data, generating better 4.3.4 Optimization of Bit Compression with Intel® SSE-Based Vectorization . . . . . . . . . . . . . 13 business decisions in response to real-time events. 4.4 Performance Benefits from Optimization This section traces the history of Intel processor innovation, and Hardware Platform Features . . . . . . . . . . . . . 13 setting the stage for an understanding of how SAP HANA 4.4.1 Performance Comparison between Processor Generations and Impact of Hardware Features . . . . . 13 appliances based on Intel architecture produce synergies 4.4.2 Performance Benefits of Intel® SSE between the hardware and software that benefit business Vectorization for Search . . . . . . . . . . . . . . . . . 14 customers. 4.4.3 Performance Scaling with Increasing Core Count . . . . 15 5 Enhancing Reliability for In-Memory Computing Systems . . . . . . . . . . . . . . . . . . . . . . 16 5.1 In-Memory Recovery . . . . . . . . . . . . . . . . . . . . 16 “Intel and SAP engineering teams have 5.2 ACID Properties . . . . . . . . . . . . . . . . . . . . . . 16 collaborated extensively on SAP HANA™ 6 Application Development and Tools and SAP In-Memory Computing technology. for In-Memory Computing. . . . . . . . . . . . . . . . . . . 16 6.1 Intel® Performance Counter Monitor for Comparing the in-memory system performance Debugging and Optimization . . . . . . . . . . . . . . . . 17 with a classical relational database, the query 6.2 Intel® Software Development Products . . . . . . . . . . 17 time was reduced from 77 minutes to 13 seconds 7 Real Customer Example: Building Richer Sales Support at Hilti Corporation . . . . 17 when run on the Intel® Xeon® processor 7500 8 Outlook and Future Research Topics . . . . . . . . . . . . 18 series. We are thrilled with this result, and it 8.1 Evolution of Non-Volatile Memory . . . . . . . . . . . . . 18 showcases once again the powerful impact of 8.2 Hardware Acceleration . . . . . . . . . . . . . . . . . . . 18 joint problem solving between our two companies.” 8.3 New Instructions . . . . . . . . . . . . . . . . . . . . . . 18 8.4 Many Cores . . . . . . . . . . . . . . . . . . . . . . . . . 18 - K. Skaugen, Intel Vice President and 9 Conclusion . . . . . . . . . . . . . . . . . . . . . . . . . . . 18 General Manager of the Data Center Group 10 Further Reading . . . . . . . . . . . . . . . . . . . . . . . . 192
  3. 3. 2.1.1 The Rise of Mainstream Server Parallelism One key solution has been to produce multi-coreAt one time, boosts in processor performance were created processors, so that each processor socket containslargely by ongoing increases in clock speed. The simple multiple, individual execution engines—a design conceptfact of providing more cycles per second retired more taken from mainframe and other large-scale systems,instructions per unit of time, allowing software to become where it had been employed for some time. Even beforemore capable and deliver faster response times. As clock introducing its multi-core designs, Intel developed Intel HTspeeds grew, however, processor designs faced limitations Technology, which allows a single processing core to handledue to growing energy consumption and heat generation. two strings of instructions (software threads) at a time, making better use of on-die resources that would otherwise be idle. Together with symmetric multi-processing (multiple processors and multiple servers working together), multi-Intel and SAP: A History of Co-Innovation core processing and Intel HT Technology have made parallelFor more than 10 years, Intel and SAP have worked programming the de facto industry standard. Some of thetogether to deliver industry-leading performance techniques used by SAP HANA to take advantage of parallelof SAP solutions on Intel® architecture, and a large hardware resources are discussed later in this paper.proportion of new SAP implementations are now 2.1.2 The Drive to 64-Bit Processingdeployed on Intel® platforms. The latest successfrom that tradition of co-innovation is available to Another limitation that enterprise computing systems have faced is that the 32-bit systems in general enterprise usecustomers of all sizes in SAP In-Memory Appliance until the mid-2000s were limited by their ability to addressSoftware (SAP HANA™), which is delivered on the no more than approximately four gigabytes (GB) of systemIntel® Xeon® processor. memory per server. Beginning in 2004 with the Intel XeonThe relationship between Intel and SAP has processor formerly code-named Nocona, Intel beganbecome even stronger over the years, growing to shipping hardware capable of running 64-bit OSs, whichinclude a broad set of collaborations and initiatives. can address far larger amounts of memory. The theoreticalSome of the most visible include limits of memory capacity are as follows:the following: 32-bit systems: 232 = 4,294,967,296 bytes, or ~4 GB• Joint roadmap enablement. Early in the design 64-bit systems: 264 = 18,446,744,073,709,551,616 bytes, process, Intel and SAP decision makers identify or ~18 billion GB (18 Exabytes) complementary features and capabilities in their In reality, the upper limits of memory addressability for upcoming products, and those insights help to 64-bit systems are governed by the physical capabilities direct the development cycle for maximum value. of the platform and to some extent by the cost of the memory itself. As platforms become more capable, system• Collaborative product optimization. Intel memory capacity continues to grow, and at the same time, engineers located on-site at SAP work with their the cost of memory on a per-gigabyte basis continues to SAP counterparts to provide tuning expertise that drop. Server memory capacities are now often measured in enables SAP HANA and other software solutions terabytes (TB), where 1 TB = 1,000 GB. to take advantage of the latest hardware features. In addition to being able to address more memory than their• Combined research efforts. Together, 32-bit ancestors, platforms based on 64-bit architectures researchers from Intel and SAP continually make more and wider registers available to the compiler explore and drive the future of business (theoretically enabling the compiler to generate faster code). computing. Belfast Collaboratory, an excellent They also have the benefit of some larger data types, a example, opened in October 2009 to innovate crucial factor for SAP HANA, where gigabytes of memory around cloud computing and the next generation are handled within a single process. In fact, the port to of sustainable IT. 64-bit architecture was one of the first joint efforts by Intel and SAP related to SAP NetWeaver ® Business WarehouseAs a result of these efforts, customer solutions Accelerator (BWA, formerly known as SAP NetWeaverachieve performance, scalability, reliability, and Business Intelligence Accelerator), a precursor in manyenergy efficiency that translate into favorable ways to SAP HANA.ROI and TCO, for increased business value. 3
  4. 4. 2.1.3 Other Architecture Features that Benefit SAP HANA Successive generations of Intel® server platforms introduce Intel Xeon processor E7 family as an example, SAP HANA an enormous range of other architectural features that benefits from Double Device Data Correction (DDDC), which drive new platform capabilities. For example, the Intel Xeon allows recovery from two simultaneous memory errors, processor E7 family adds the following new features and supporting greater availability of SAP HANA systems. capabilities relative to its predecessor that enhance the Platform innovation also includes the introduction of new scalable performance of SAP HANA implementations: instructions such as the series of Intel® Streaming SIMD • Increased core count, from 8 cores (16 threads) to 10 Extensions instruction sets (Intel® SSE, Intel® SSE2, Intel® cores (20 threads) per socket, provides more computation SSE3, Intel® SSSE3, Intel® SSE4.1, and Intel® SSE4.2). These resources for SAP HANA’s data manipulation demands. instructions are designed to accelerate specific types of • Larger memory capacity, from 16 GB DIMMs (1 TB for operations, and their impact on SAP HANA is discussed in a four-way server) to 32 GB DIMMs (2 TB for a four-way further detail later in this paper. server), increases the ability of SAP HANA to hold large Finally, Intel continues to innovate around process- amounts of data for in-memory computation. technology shrink in its manufacturing operations. Moving • Larger cache, from 24 megabytes (MB) to 30 MB, to a smaller process technology allows the creation of increases the amount of data SAP HANA can hold for smaller transistors and other on-chip features, which allows fast access without reading from DRAM. more of those features to be placed in a given amount Likewise, evolving server platforms provide new features for of on-die area, for faster switching, lower per-transistor security, and reliability, availability, and serviceability (RAS) energy consumption, and lower per-transistor waste-heat that support business-critical SAP HANA deployments. production. Figure 1 shows some key advances in the Again using features and capabilities introduced with the development of the Intel Xeon processor E7 family. 2010: 2011: Intel® Xeon® Processor 7500 Series Intel® Xeon® Processor E7 Family C C C C C C C C C C C C C C C C C C Figure 1. The Intel® Xeon® processor E7 O O O O O O O O O O O O O O O O O O family adds features and capabilities that R R R R R R R R R R R R R R R R R R directly benefit SAP HANA deployments. E E E E E E E E E E E E E E E E E E 1 2 3 4 5 6 7 8 1 2 3 4 5 6 7 8 9 10 24 MB Shared Last Level Cache 30 MB Shared Last Level Cache Integrated Memory QPI Link Integrated Memory QPI Link Controller Controller Controller Controller Support for 16 GB Four Intel® Support for 32 GB Four Intel® DDR3 DIMMs QuickPath DDR3 DIMMs QuickPath (1 TB RAM for a Interconnects at (2 TB RAM for a Interconnects at four-socket system) up to 6.4 GT/s four-socket system) up to 6.4 GT/s 45nm process 32nm process technology technology • Intel® Turbo Boost Technology • Enhanced Intel® Turbo Boost Technology • Intel® Intelligent Power Technology • Intel® Intelligent Power Technology • Intel® Hyper-Threading Technology • Intel® Hyper-Threading Technology • Machine Check Architecture Recovery • Machine Check Architecture Recovery • Intel® Scalable Memory Interconnect Failover • Intel® Scalable Memory Interconnect Failover • Intel® QuickPath Interconnect Self-Healing • Intel® QuickPath Interconnect Self-Healing • Intel® Trusted Execution Technology • Intel® AES New Instructions • Double Device Data Correction (DDDC) • Partial Memory Mirroring4
  5. 5. 2.2 SAP Software Technology Innovations • A calculation and planning engine with a dataAround 2000, SAP recognized the trends toward repository, providing persistent views of businessrapidly-growing data sets owned by many companies information and a unified information modelingand increasing business needs for real-time analytics. design environmentIn response to those realities, SAP engineers began to • A data management service that includes SQLre-imagine how data warehouses could handle these and MDX interfaces, data integration capabilities formassive data stores to generate immediate, ad hoc analysis. accessing SAP and non-SAP data sources, andDisk-access times had become a primary bottleneck in integrated life cycle management.addressing those needs using conventional databasequeries. Research showed promise for removing that factorby conducting queries solely in memory, without having to SAP In-Memory Computing Databaseaccess the physical disk. SAP In-Memory Computing TechnologyTwo key factors helped make in-memory computing viable Calculation andfor mainstream implementations. As discussed above, by Row Column Planning Engine Store Storethe middle of the first decade of the 21st century, systemmemory had begun to fall sharply in cost, and 64-bit servershad opened the door to commercial off-the-shelf hardware Data Management Servicethat could address very large amounts of system memory.Those advances set the stage for SAP’s implementation of Figure 2. SAP In-Memory Computing supports complexin-memory computing for general business use. calculations on massive amounts of data. When combined, these capabilities allow the SAP“For over 10 years, SAP and Intel have In-Memory Computing Database to help decision makers collaborated to create visionary solutions that explore and analyze vast amounts of information at enable customers to gain competitive advantage incredible response times and with a high degree of flexibility, without the need for IT involvement. and run better. The SAP In-Memory Appliance (SAP HANA™), delivered on the Intel® Xeon® 2.2.2 Delivering the SAP In-Memory Computing processor, provides customers with real- Database into the Mainstream time results from analyses and transactions SAP first made SAP In-Memory Computing available in that enable better decisions and improved product form with the introduction of TREX (Text Retrieval operations. It’s just the latest example of how and information EXtraction), a search engine component game-changing innovation between industry used as an integral part of various SAP software offerings. Building on the early success of products such as SAP leaders can alter how enterprises do business NetWeaver Enterprise Search, the company introduced and succeed.” SAP BWA in 2005. - Stefan Sigg, Senior Vice President, The company’s ongoing focus on business intelligence (BI) In-Memory Platform, SAP solutions led it to integrate SAP In-Memory Computing into SAP BusinessObjects to create an accelerated BI solution called the SAP BusinessObjects Explorer, Accelerated2.2.1 Establishing a Solid Database Version. SAP In-Memory Computing has also been used for In-Memory Computing to enable real-time analysis in a variety of other products,In-memory computing allows the processing of massive including the following:quantities of real-time data in the main memory of the • SAP Advanced Planner and Optimizer with liveCacheserver, providing immediate results from analyses andtransactions. The SAP In-Memory Computing Database • SAP CRM Customer Segmentation(Figure 2) delivers the following capabilities: • SAP Business ByDesign™ Analytics• In-memory computing functionality with native support for row and columnar data stores (discussed below), providing full ACID (atomicity, consistency, isolation, durability) transactional capabilities 5
  6. 6. 2.2.3 SAP HANA: The Next Generation of Real-Time SAP HANA benefits dramatically from the design features Business Analysis and benefits of the latest Intel Xeon processors, helping it SAP In-Memory Computing has evolved to create SAP deliver excellent value to end-customers. The performance, HANA, a multi-purpose appliance that combines SAP scalability, reliability, and energy efficiency of the hardware software components installed on servers based on the platform are particularly beneficial in this regard. High memory Intel Xeon processor. The integrated software stack is highly capacity and innovations in the memory and cache subsystems, optimized for performance and reliability on the hardware, as described above, support excellent results from the demands and it includes the SAP In-Memory Computing Database, of in-memory computing. The sophisticated parallel real-time replication service, data modeling, and data software design, as described below, takes advantage services, as illustrated in Figure 3. of the platform’s large-scale hardware parallelism. SAP HANA is engineered specifically to be data-agnostic, Other Applications SAP Business Objects and the appliance is designed to be implemented without affecting existing applications or systems. The solution MDK SQL BICS provides a flexible, cost-effective, real-time approach for managing large data volumes, allowing organizations to SAP NetWeaver® SAP In-Memory Computing Database Business Warehouse dramatically reduce the hardware and maintenance costs In-Memory Calculation and associated with running multiple data warehouses and Third-Party Computing Planning Engine analytical systems. Data Sources In the future, SAP HANA will act as the technology foundation SAP Business Suite Data Management Service for new, innovative business applications for planning, forecasting, operational performance, and simulation. The Admin and Data Modeling first business application running on SAP HANA is the SAP BusinessObjects Strategic Workforce Planning, version Figure 3. SAP HANA Real-Time Replication Services provides data-agnostic for in-memory computing, a solution SAP developed that integration with other Data Services dramatically benefits from high-performance and real-time business systems. access to large volumes of transactional information. 3 Creating a Transition in Database Design With the growth of data stores and business application complexity, solution providers have steadily developed The Intel and SAP HANA™ Value Proposition novel approaches to optimizing performance. One key Across a Wide Variety of Business Units development has been the separation of online transactional SAP In-Memory Appliance Software (SAP HANA™) processing (OLTP) and online analytical processing (OLAP) systems because both these types of relational databases and the Intel® Xeon® processor enable a broad set have somewhat divergent design goals. In the context of BI of scenarios within various parts of the business, solutions, they can be considered as follows: above and beyond transactional excellence and • OLTP systems instantly record business events as they traditional analytics, such as the following: happen, such as the sale of a piece of inventory. As such, • Order management. Available to promise, system designs focus on quickly handling large numbers price optimization of small, simultaneous transactions. • Supply chain. Transportation planning, inventory • OLAP systems provide analysis of the data provided by management, demand and OLTP systems to support business decisions. They are supply planning designed to handle a relatively small number of often- • Manufacturing. Production planning, performance complex transactions. management, asset utilization In his paper, “A Common Database Approach for OLTP • Sales and marketing. Analytics, customer and OLAP Using an In-Memory Column Database,”1 SAP segmentation, trade promotion management co-founder Hasso Plattner asserts the value of unifying both OLTP and OLAP in a single system based on column • Corporate finance. Planning and storage and in-memory computation. While the concept of budgeting, consolidation combining the two systems is beyond the scope of this paper, • Human resources. Workforce analytics Plattner’s argument for the value of column storage and • Information technology. Landscape optimization in-memory computing to both OLTP and OLAP has direct bearing on the capabilities of SAP HANA in those areas.6
  7. 7. 3.1 Performance Implications of Column versus Columnar storage may be more efficient in the common Row Storage case where a given column includes only a relativelyWhereas relational databases typically use row-based small number of distinct values. For example, if the tablestorage, column-based storage may be more suitable shown in Figure 4 includes a very large number of rows,in some cases. As shown in Figure 4, a database table one would expect that each entry in the Country columnis conceptually a two-dimensional structure made up of would nevertheless consist of one of perhaps a few dozencells arranged in rows and columns. Because computer possible entries. That factor allows for very efficient datamemory is structured linearly, there are two options for compression, which produces a number of advantages.the sequences of cell values that are stored in contiguous For example, data compression allows for faster loadingmemory locations: from system memory into the processor cache. Since• In row storage, the data sequence consists that transport tends to be the main factor limiting system of the data fields in one table row. performance, speeding it up may be expected to produce• In column storage, the data sequence consists a performance advantage, even considering the added of the entries in one table column. overhead of decompressing the data. Another example of the advantages of columnar storage is related toFigure 4 illustrates both options. compression: Computing the sum of values in a column is much faster if many additions of the same value can Table be replaced by a single multiplication. Similarly, holding values from a column together in cache is crucial for bulk Country Product Sales processing using Intel SSE instructions. US Alpha 3.000 Columnar storage also helps reduce the number of US Beta 1.250 performance-limiting cache misses. Recognizing that JP Alpha 700 data is transferred from main memory to the processor in UK Alpha 450 blocks of fixed size called cache lines, consider the case of aggregating the sum of all of the values in the Sales column Row Store Column Store of a large version of the table in Figure 4. With row storage, only a minority of the data in any given cache line would US US be from the Sales column and therefore be relevant to the Row 1 Alpha US Country calculation. The large proportion of irrelevant data would 3,000 JP result in cache misses, where the processor would have US UK to discard the irrelevant data and wait for useful data in its Row 2 Beta Alpha place. Conversely, with columnar storage, every piece of 1.250 Beta Product data in each cache line would be relevant to the calculation, JP Alpha increasing efficiency by keeping the processor supplied Row 3 Alpha Alpha with useful data. 700 3.000 UK 1.250 Sales Row 4 Alpha 700 450 450Figure 4. Unlike traditional databases, SAP HANA offers the choicebetween row-based and column-based data storage for eachtable. 7
  8. 8. 3.2 Dictionary Compression with Column Storage 3.3 Efficiencies from Separation of Main Dictionary coding is a significant benefit when compressing and Delta Storage column-stored data, as represented in Figure 5. Each As changes are committed into the database, they are distinct column value is assigned a sequential integer stored in delta storage, which is distinct from main storage, (beginning with zero) as a Value ID. An alphanumerically and if new data contains column values that are not sorted dictionary stores all of the distinct column values, included in the main dictionary, they are stored in a delta and the Value IDs are implicitly given by the sequential dictionary. Because the delta dictionary is created without positions of each distinct column value. The actual column recoding the Value IDs in the main dictionary, there is no data is stored as a sequence of Value IDs, which are bit global order that encompasses the Value IDs in both the coded, meaning that they are stored using the minimum delta dictionary and the main dictionary. possible number of bits. Column ”Name“ Column ”Name“ (dictionary compressed) (uncompressed) Value-ID sequence Dictionary One element for each row in column Miller 4 0 Baker Jones 1 1 Jones Millman 5 Point into dictionary 2 John Sorted Zsuwalski N 3 Johnson Value IDs Baker 0 4 Miller Miller 4 5 Millman John 2 . Miller 4 .. Johnson 3 N Zsuwalski Jones 1 . .. ... Value ID implicitly given by sequence in Value which values are stored Figure 5. Dictionary coding can enable significant compression advantages in column-stored data sets. Dictionary compression of columnar data is most efficient The delta store supports “greater than” and “less than” when the number of distinct column values is relatively queries by adding a CSB+-Tree to the delta dictionary. This small, compared to the size of the column (that is, when tree holds the values of the delta store ordered as in the a relatively low number of Value IDs exists relative to the main store, only the amount of memory needed is higher number of column values). than in the main store case. To establish a single, sorted dictionary that includes all Value IDs when a delta dictionary Because the dictionary is sorted, search operations and has been established, the delta dictionary and main comparisons such as “greater than” and “less than” can dictionary must be merged using a delta merge operation. be performed against the value IDs to return results that To enable even higher compression of main storage, SAP are meaningful with regard to the actual column values. HANA selects and implements the optimal compression Because those operations are conducted on integer values, method for the dictionary itself during delta merge. they can be much faster than, for example, performing them against long strings.8
  9. 9. 3.4 Storage of and Queries against Historical Data 4 Collaboration by SAP and Intel for PerformanceAs an optional feature, SAP HANA can use temporal tables Optimization and Parallelizationto query against an historical state of the data. Write Co-innovation and collaboration between SAP and Intel isoperations on temporal tables do not physically overwrite an important prerequisite for the success of the SAP HANAexisting records. Instead, write operations always insert new software components on Intel architecture-based hardware.versions of the data record into the database. The different The optimization draws both from industry best practicesversions of a data record have time stamps indicating their and from solution-specific engineering, drawing broadlyvalidity, with current data being simply the most recent version. on the expertise of performance engineers from both companies. As mentioned previously in this paper, theseHistorical (non-current) data stored in a temporal table individuals have a long history of working together, and theiris still considered to be operational data, meaning that work on SAP HANA represents the culmination of manyit is expected to be used in current operations, such as years’ experience.transactions or reporting. By contrast, future versions ofSAP HANA plan to support a category specifically for data This four-part section identifies some of the specificthat is not expected to be used anymore, known as aged optimizations the team has made to the SAP HANA platformdata, which no longer need to be stored in memory. Instead, and the resulting performance benefits on Intel Xeonaged data will be eligible to be written to disk and released processor-based systems. The first three subsectionsfrom memory. describe the optimizations, organized for convenience at decreasing scale within the system:Note that while aged data would not be accessed duringnormal operations, it would still be available if needed. To • Among sockets. The Benefits from Multi-Socketreduce access time, systems could use solid-state drives Topology and NUMA subsection describes thefor this purpose. Alternatively, aged data could be written to collaboration around features and capabilities at the inter-flash memory, which can be accessed even more quickly. processor level.For each individual type of data, application developers • Among cores. The Optimization for Multi-Core Processingwill need to create rules that govern when data should subsection describes the collaboration around featuresbe considered “aged.” For example, sales data might be and capabilities at the intra-processor, inter-core level.considered aged once the order it is associated with has • Within cores. The Optimization for Core-Level Featuresbeen closed for at least 90 days. subsection describes the collaboration around features3.5 Boosting Performance by Moving Results and capabilities at the intra-core level. Instead of Source Data The fourth subsection, Performance Benefits fromThe data-intensive business applications in common use Optimization and Hardware Platform Features, presentsoften move large, detailed data sets from the data layer some key performance results associated with theto the application layer, where it performs operations on optimizations described in the previous subsections.that data. This arrangement is not optimal in many cases,since the result of the calculation is typically smaller than 4.1 Benefits from Multi-Socket Topology and NUMAthe source data needed to derive that result. Therefore, it Because SAP HANA is a large-scale multi-socketis generally more efficient to perform calculations before environment, it benefits substantially from the NUMAmoving the data and then move only the results. (non-uniform memory access) shared-memory architecture.To benefit from that scheme, SAP HANA incorporates In multi-processor systems that are not NUMA capable (ora calculation engine into the data layer. The calculation where NUMA is not enabled), each processor is connectedengine can be manipulated programmatically by application directly to all available system memory through thedevelopers. In this way, high-performance applications based system bus. The access path between any combination ofon the SAP HANA architecture can delegate data-intense processor and memory is direct and electrically equivalentoperations to the data layer, achieving higher efficiency and to any other combination. This arrangement is thereforeperformance than would otherwise be possible. known as “uniform memory access” (UMA), as shown in Figure 6a. Note that all the processors must contend for bandwidth on the system bus. 9
  10. 10. In NUMA systems (Figure 6b), each processor is connected In multi-threaded application software, the OS migrates directly to its own local memory, and all of the memory threads between processor cores as needed to match banks are connected together by the system bus. workloads to available execution resources. Assuming that Therefore, any processor has direct, relatively high-speed the data associated with a specific thread is already resident access to local memory, as well as access to all other in memory, migrating a thread to a core in a different memory through the system bus. Because bus-bandwidth processor package carries with it a performance penalty. contention only occurs on non-local memory accesses, Therefore, NUMA-capable software must be capable of there is a performance advantage to holding the data working with the OS to manage the associated balance in needed by a given processor in its own local memory; that performance implications. For example, there may be an placement is known as “processor affinity.” advantage from moving a thread to a processor package where more execution resources are available that must be weighed against the disadvantage of moving data across the system bus. Processor Processor Processor Processor 4.2 Optimization for Multi-Core Processing To take advantage of the increasing core count in modern processors (as many as 10 cores supporting 20 threads in the Intel Xeon processor E7 family), SAPA HANA is Memory engineered for parallel execution that scales well with the Chipset number of available cores. Specifically, optimization for Interface multi-core platforms accounts for the following two key considerations: I/O • Data is partitioned wherever possible in sections that (a) allow calculations to be executed in parallel. • Sequential processing is avoided, which includes finding alternatives to approaches such as thread locking. I/O 4.2.1 Parallel Aggregation Chipset In the shared-memory architecture within a node, SAP HANA performs aggregation operations by spawning a number of threads that act in parallel, each of which has equal access to the data resident on the memory on that Memory Memory Interface Processor Processor Interface node. As illustrated at the top of Figure 7, each aggregation functions in a loop-wise fashion as follows: 1. Fetch a small partition of the input relation. 2. Aggregate that partition. Memory Memory Interface Processor Processor Interface 3. Return to step 1 until the complete input relation is processed. Chipset I/O (b) Figure 6. In NUMA systems, each processor has fast, direct access to its own local memory module, reducing the latency that arises due to bus-bandwidth contention.10
  11. 11. Table Aggregation Thread 1 Aggregation Thread 2 Aggregation Thread 1 Local Hash Table 1 Local Hash Table 1 Cache-Sized Hash Tables Buffer Merger Merger Thread 1 Thread 2 Part Hash Table 1 Part Hash Table 2Figure 7. SAP HANA employs parallel aggregation to divide among threads the work of fetching, aggregating, and merging data.Note that the approach of using small chunks and As described above in the Performance Implications ofdynamically assigning them to threads distributes the Column versus Row Storage section, determining what datacomputation costs over all threads relatively evenly, even if will be fetched together in a given cache line is an importantrequirements differ substantially among the chunks. aspect of optimizing performance within the memory/cache subsystem. Block-wise decompression of data allowsEach thread has a private hash table where it writes its decompressed data to reside in L1 cache for access byaggregation results. When the aggregation threads are the processor.finished, the buffered hash tables are merged by mergerthreads using range partitioning. Each merger thread The columnar design results in consecutive access of largeis responsible for a certain range, which is indicated in amounts of data, which is an excellent fit for the hardwareFigure 7 by a different shade of grey. The merger threads prefetchers built into the Intel Xeon processors. The SAPaggregate their partition into part hash tables, and the final HANA development team added software prefetchingresult is obtained by concatenating the part hash tables. hints to improve performance in the limited number of circumstances where data was not automatically prefetched4.2.2 Processor Cache Optimizations (for example, when crossing page boundaries). Similarly, inTaking best advantage of processor cache resources is Figure 7, note that the hash tables to which worker threadsanother aspect of performance optimization that plays write aggregation results are sized to the cache, helping toan important role in the development of SAP HANA. reduce the incidence of cache misses in handling that data.Because accessing data from cache is so much faster thanaccessing data in main memory, special care must be takento improve the odds that the processor can access the datait needs at any given time from cache, rather than beingforced to access memory. 11
  12. 12. 4.3 Optimization for Core-Level Features In large part, achieving the optimal performance benefit In addition to the advantages available to SAP HANA at the from Intel HT Technology is simply an extension of the levels of interactions among processors and interactions optimization that must be done to take advantage of among cores, a number of features are also relevant at hardware parallelism generally. An in-depth discussion of the level of operation within individual cores. Some of issues specific to Intel HT Technology is beyond the scope those advantages and considerations are discussed in this of this paper. For more information, see the paper, section. “Performance Insights to Intel® Hyper-Threading Technology.”2 To help ensure optimal benefit from Intel HT Technology, the 4.3.1 Optimization for Intel® HT Technology development team addressed the following considerations: Intel HT Technology helps improves performance through increased instruction level parallelism by enabling each • Parallel scaling. Intel HT Technology effectively doubles processor core to accommodate two threads with the number of software threads required to take full independent instruction streams. While the two threads advantage of the execution hardware, making SAP necessarily share execution resources within the core, HANA’s ability to identify and spawn the correct number the presence of two threads can increase utilization of the of threads even more important. available execution units. This effect typically increases the • Thread balance. Obtaining the optimal performance number of instructions executed in a given amount of time benefit from Intel HT Technology requires the work to be within a core, as shown in Figure 8. divided as evenly as possible among software threads. One approach taken with SAP HANA is to assign work in Without With very small chunks, as described above in the discussion Intel® HT Intel® HT of parallel aggregation. Technology Technology • Avoiding threading bottlenecks. The increased hardware parallelism associated with Intel HT Technology Processor Execution carries with it an increased danger from threading Units Working on Thread 1 bottlenecks, so the team refined SAP HANA’s threading model to avoid issues such as false sharing and excessive Processor Execution locks and synchronization. Time (proc. cycles) Units Working on Thread 2 4.3.2 Standards Compliance and Accuracy for Decimal Floating-Point Math Processor Execution Units Sitting Idle The IEEE Standard 754-2008 for Binary Floating-Point Arithmetic defines functions that overcome important limitations in previous versions, most recently Standard 754-1985. Specifically, the standard describes multiple encoding methods for the numbers used in the calculations. Under the previous versions of the standard, certain applications for use in regulated industries such as finance were forced to use more resource-intensive encoding to meet accuracy requirements. The accuracy issues were related to complexities in representing decimals Figure 8. Intel® Hyper-Threading Technology (Intel® HT Technology) precisely in binary format. For example, the following gives the processor access to two threads in the same time slice, calculation calculated in single-precision yields the result reducing the level of idle hardware resources. 6.9999997504, rather than 7.00:3 (7.00 / 10000.0) * 10000.0 This greater efficiency enables higher throughput (since more work gets completed per clock cycle) and higher The Intel® Decimal Floating-Point Math Library provides an performance per watt (since fewer idle execution units efficient implementation of binary integer decimal encoding consume power without contributing to performance). that complies with Standard 754-2008, which enhances In addition, when one thread has a cache miss, branch the suitability of SAP HANA for use in the full spectrum of mispredict, or any other pipeline stall, the other thread industry settings. continues processing instructions at nearly the same rate as a single thread running on the core.12
  13. 13. 4.3.3 Benefits from Intel® Turbo Boost TechnologyIntel® Turbo Boost Technology4 automatically allows processor cores to run faster than the base operating frequencywhen the OS requests the highest processor performance state (P0), if the processor is operating below power, current,and temperature specification limits. The extent of this capability is also dependent on the number of active cores in theprocessor package, as illustrated in Figure 9. 10 Cores Operating Under Fewer than 10 Cores Operating Under Normal Operation Inte® Turbo Boost Technology Inte Turbo Boost Technology Frequency CORE 0 CORE 1 CORE 2 CORE 3 CORE 4 CORE 5 CORE 6 CORE 7 CORE 8 CORE 9 CORE 0 CORE 1 CORE 2 CORE 3 CORE 4 CORE 5 CORE 6 CORE 7 CORE 8 CORE 9 CORE 0 CORE 1 All cores operate at rated frequency All cores operate at higher frequency Fewer cores may operate at even higher frequencyFigure 9. Intel® Turbo Boost Technology allows individual cores to operate briefly above the thermal design point, under certain conditions. Intel® Instruction Sets SSE (1999) SSE2 (2000) SSE3 (2004) SSSE3 (2006) SSE4.1 (2007) SSE4.2 (2008) Intel® AES-NI (2010) 70 Instructions: 144 Instructions: 13 Instructions: 32 Instructions: 47 Instructions: 7 Instructions: • Single-precision • Double-precision • Complex • Decode • Video • String 6 Instructions: vectors vectors arithmetic operations accelerators manipulation • Encryption and • Streaming • 128-bit vector • Graphics • Cyclic decryption operations integer • Co-processor Redundancy accelerators Check (CRC32)Figure 10. Successive generations of instructions, including Intel® Streaming SIMD Extensions (Intel® SSE) and Intel® Advanced EncryptionStandard New Instructions (Intel® AES-NI) accelerate various functions on Intel® architecture.4.3.4 Optimization of Bit Compression with 4.4 Performance Benefits from Optimization Intel® SSE-Based Vectorization and Hardware Platform FeaturesIntel periodically introduces sets of instructions that extend This section covers the performance increases enabled byplatform capabilities with certain types of operations, as the optimization work on SAP HANA for the Intel processorillustrated in Figure 10. The value of these instructions is E7 family conducted jointly by SAP and Intel. It alsoprimarily to accelerate those operations and secondarily to describes the performance benefits that are available fromreduce the complexity of the implementation. some key hardware-platform features.Intel SSE instructions provide the ability to support vector 4.4.1 Performance Comparison between Processoroperations on 128-bit wide registers. Theoretically, these Generations and Impact of Hardware Featuresoperations should result in a 4x speed up on 32-bit To demonstrate the performance benefits of implementingdata types and a 2x speed up on 64-bit data types. SAP HANA on the latest Intel Xeon processors, a test labSAP HANA uses this advantage for decompression at Intel compared results against a reference workloadand search on its base data structures. In cooperation between the Intel® Xeon® processor 7500 series and thewith Intel, SAP engineers have developed a highly Intel Xeon processor E7 family, which is summarizedoptimized set of algorithms. By default, SAP HANA in Figure 11.5 The workload consisted of a single-userchecks on startup to determine which sets of Intel SSE scenario that was provided by an SAP customer and usedinstructions are supported by the processor and chooses as the internal reference scenario for SAP. It processed 3.2the implementation that is best suited to maximize billion records of point-of-sale data in a complex fashion.decompression and search performance. 13
  14. 14. SAP HANA™ 1.0 Performance Comparison: Intel® Xeon® Processor E7 Family versus Intel® Xeon® Processor 7500 Series and Hardware Feature Performance Impacts 97% 97% 98% 3,500,000 95% 100% 92% 90% 3,000,000 288,873 80% 255,658 Running time [ms] 2,500,000 70% CPU Utilization 225,587 218,114 210,471 60% 2,000,000 1.37x 50% 1,500,000 1.21x 40% 1,000,000 30% 1.07x 20% 500,000 1.04x 10% Running time [ms] 0 0% CPU Utilization Intel Xeon Processor Intel Xeon Processor E7 Family 7500 Series Intel® Hyper-Treading Technology ON OFF ON ON ON Non-Uniform Memory Access (NUMA) ON ON OFF ON ON Intel® Turbo Boost Technology ON ON ON OFF ON Figure 11. Testing demonstrated a 1.37x performance advantage by the Intel® Xeon® processor E7 family over the Intel® Xeon® processor 7500 series and benefits from key platform features.5 The testing summarized in Figure 11 consisted of two major As reported in Figure 11, the impact of the processor aspects: testing the performance differential between generation alone was that the Intel Xeon processor E7 the two processor generations and testing the distinct family delivered a 1.37x performance increase, relative to performance impacts of individual hardware-platform the Intel Xeon processor 7500 series.5 On the system based features (that is, Intel HT Technology, NUMA, and Intel Turbo on the Intel Xeon processor E7 family, Intel HT Technology Boost Technology. The testing approach consisted of two produced a 1.21x performance increase,5 NUMA technology primary parts: produced a 1.07x performance increase,5 and Intel Turbo • Performance impact of processor generation. A direct Boost Technology produced a 1.04x performance increase.5 performance comparison was made between the two Notably, these performance increases were measured processor generations under test, with all three of the under relatively high levels of processor utilization, platform features enabled. This result intended to show demonstrating their value in adding to system-level the effect of the processor generation in isolation from performance headroom. other aspects of the platform. 4.4.2 Performance Benefits of Intel® SSE • Performance impact of platform features. Each of the Vectorization for Search three platform features was disabled on the Intel Xeon SAP HANA ensures flexibility and makes the most of processor E7 family-based system, one at a time with installed system memory by compressing all base data the other two features enabled, and each of those results structures using a packed bitfields approach for each of the was compared to the case in which all three features were bit cases 1 to 32 bits. (For a discussion of compression in enabled. This result was used to show the isolated effect the SAP In-Memory Computing Database, see the Creating of the platform feature. a Transition in Database Design section.) Using the packed bitfields approach enables the system to store integer values more efficiently than as standard STL vectors.14
  15. 15. Instead of using 32 bits per integer, SAP HANA uses only as many bits as are needed to represent the highest distinct valuein a vector as a bit string. A separate implementation is required for each bit case, and over the average for all bit cases,using Intel SSE with TREX accelerates search on the compressed data by a factor of 1.6x relative to scalar code on theIntel® Xeon® processor 5500 series, as shown in Figure 12.6 2.50 Speed-up of Intel® SSE versus Scalar Code 2.00 1.50 1.00 0.50 0.00 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 Bit CaseFigure 12. On average over all bit cases, Intel Streaming SIMD Extensions (Intel® SSE) accelerates search on compressed data usingSAP In-Memory Computing Database technology by a factor of 1.6x on the Intel® Xeon® processor 5500 series.6 SAP Business Warehouse Accelerator Core-Frequency Scaling4.4.3 Performance Scaling with Increasing Core Count 0.68 Higher is betterTo check performance scalability relative to increases in the 0.67number of processor cores, the team began by calculatingthe “ideal” case, in which the number of queries per 0.66 Queries per Second (Compiled)second would increase linearly on a one-to-one basis withprocessor frequency (represented by the upper, dashed 0.65line in Figure 13). Next, by adjusting the core frequencyusing BIOS settings, they measured the actual increase 0.64 Ideal(represented by the lower, solid line). Measured 0.63The results show that performance increases relative toincreases in processor frequency at an efficiency of 91 to 0.6294 percent.7 While these results were obtained using SAPBusiness Warehouse Accelerator, a predecessor technology 0.61to SAP HANA, note that both software environments arebased on similar in-memory computing technology. 0.6 2.65 2.70 2.75 2.80 2.85 2.90 2.95 Core Frequency (GHz) Changed through Frequency Multiplier in BIOS Figure 13. SAP Business Warehouse Accelerator workloads scale almost at the ideal rate as processor core frequency increases.7 15
  16. 16. 5 Enhancing Reliability for In-Memory Computing Systems 5.1 In-Memory Recovery SAP HANA includes software capabilities that take • Isolation is the requirement that all data associated with advantage of the hardware features of Intel Xeon processors a transaction be inaccessible by any other operation for to dramatically reduce the incidence of system failures. the duration of the transaction. This property is necessary In addition, the SAP HANA appliance can fail over from to prevent multiple writes to a single data location during one compute node to another in the event of a processor transactions, which could lead to data inconsistency failure and to handle memory errors with as little impact to and unpredictable results. Because Isolation requires workloads as possible. data to be locked while it is in use for a transaction, SAP A growing scope of RAS capabilities have been introduced HANA uses sophisticated algorithms to minimize the in the Intel Xeon processor, reaching a suitability for performance impact of that locking. business-critical implementations that was previously • Durability means that once a transaction has been associated with only more expensive RISC and other committed to the database it can always be recovered in proprietary solutions. For example, Machine Check the event of a subsequent hardware or software failure. Architecture Recovery enables SAP HANA to work with the This property prevents completed transactions from OS to recover from many otherwise-uncorrectable memory having to be reversed. Because in-memory databases errors, recovering data without downtime in many cases. use volatile memory as their main data store, SAP HANA 5.2 ACID Properties relies on the use of flash memory to provide Durability SAP HANA has been designed and refined to provide by retaining a combination of snapshots and logs that excellent compliance with the so-called ACID set of together assure the recoverability of committed data in properties designed to ensure reliable processing by the event of a system failure. a database system. In general, those properties are as follows: 6 Application Development and Tools for In-Memory Computing • Atomicity is that property of a database transaction SAP HANA provides a powerful application-development that guarantees that if the entire transaction cannot be completed, no change is made to the database, meaning environment specifically tailored to the needs of that “partial” transactions will not take place. Atomicity is organizations in building business applications. It also intimately related to the reliability features of the platform provides SQL Script, a language for processing application- described above, in the sense that those characteristics specific code in the database layer, which has the following help prevent the system from failing in mid-transaction two advantages: or help it recover if a failure does occur. In addition to • Transfer efficiency. Moving calculations to the database those capabilities, SAP HANA includes programmatic layer eliminates the need to transfer large amounts of data safeguards to ensure Atomicity. from the database to the application. • Consistency refers to the assurance that all transactions • Feature utilization. Calculations need to be executed performed by the database will be taken from one in the database layer to get the maximum benefit consistent state to another, which includes the from features such as fast column operations, query requirement that only valid data will be written to the optimization, and parallel execution. database. In the event that invalid data is written to the database in error, the Consistency property requires that the database can be rolled back to a last known good state. This capability is provided for by the presence of temporal tables described previously in this paper, and it will be enhanced by the future implementation of aged data features also discussed.16