Performance Analysis of TLS Web Servers


Published on

  • Be the first to comment

  • Be the first to like this

No Downloads
Total views
On SlideShare
From Embeds
Number of Embeds
Embeds 0
No embeds

No notes for slide

Performance Analysis of TLS Web Servers

  1. 1. Performance Analysis of TLS Web Servers Cristian Coarfa, Peter Druschel and Dan S. Wallach Department of Computer Science Rice University Abstract ciphers, and message integrity functions. In its most com- mon web usage, TLS uses 1024-bit RSA encryption to transmit a secret that serves to initialize a 128-bit RC4 TLS is the protocol of choice for securing today’s e- stream cipher and uses MD5 as a keyed hash function. commerce and online transactions, but adding TLS to a (Details of these algorithms can be found in Schneier [25] web server imposes a significant overhead relative to an and most other introductory cryptography texts.) insecure web server on the same platform. We perform TLS web servers incur a significant performance a comprehensive study of the performance costs of TLS. penalty relative to a regular web server running on the Our methodology is to profile TLS web servers with trace- same platform (as little as a factor of 3.4 to as much as driven workloads, replacing individual components inside a factor of 9, in our own experiments). As a result of TLS with no-ops, and measuring the observed increase in this cost, a number of hardware accelerators are offered server throughput. We estimate the relative costs of each by vendors such as nCipher, Broadcom, Alteon and Com- component within TLS, predicting the areas for which fu- paq’s Atalla division. These accelerators take the modular ture optimizations would be worthwhile. Our results we exponentiation operations of RSA and perform them in show that RSA accelerators are effective for e-commerce custom hardware, thus freeing the CPU for other tasks. site workloads , because they experience low TLS ses- Researchers have also studied algorithms and systems sion reuse. Accelerators appear to be less effective for to accelerate RSA operations. Boneh and Shacham [8] sites where all the requests are handled by a TLS server, have designed a software system to perform RSA opera- thus having higher session reuse rate; investing in a faster tions together in batches, at a lower cost than doing the CPU might prove more effective. operations individually. Dean et al. [9] have designed a network service, offloading the RSA computations from web servers to dedicated servers with RSA hardware. 1. Introduction A more global approach was to distribute the TLS pro- cessing stages among multiple machines. Mraz [16] has Secure communication is an intrinsic demand of to- designed an architecture for high volume TLS Internet day’s world of online transactions. The most widely servers that offloads the RSA processing and bulk cipher- used method is SSL/TLS [10]. Original designed at ing to dedicated servers. Netscape for its web browsers and servers, Netscape’s Se- The TLS designers knew that RSA was expensive and cure Socket Layer (SSL) has been standardized by the that web browsers tend to reconnect many times to the IETF and is now called Transport Layer Security (TLS). same web server. To address this, they added a cache, al- TLS runs at the transport layer above existing protocols lowing subsequent connections to resume an earlier TLS like TCP. TLS is used in a variety of application, including session and thus reuse the result of an earlier RSA com- secure web servers, secure shell and secure mail servers. putation. Research has suggested that, indeed, session As TLS is most commonly used for secure web applica- caching helps web server performance [11]. tions, such as online banking and e-commerce, our goal is Likewise, there has been considerable prior work in per- to provide a comprehensive performance analysis of TLS formance analysis and benchmarking of conventional web web servers. While previous attempts to understand TLS servers [15, 12, 17, 5, 18], performance optimizations of performance have focused on specific processing stages, web servers, performance oriented web server design, and such as the RSA operations or the session cache, we ana- operating system support for web servers [13, 22, 6, 7, 21]. lyze TLS web servers as systems, measuring page-serving Apostolopuolos et al. [3] studied the cost of TLS con- throughput under trace-driven workloads. nection setup, RC4 and MD5, and proposed TLS connec- TLS provides a flexible architecture that supports a tion setup protocol changes. number of different public key ciphers, bulk encryption
  2. 2. Our methodology is to replace each individual opera-   Allow the client and server to verify that their peer tion within TLS with a “no-op” and measure the incre- has calculated the same security parameters and that mental improvement in server throughput. This method- the handshake occurred without tampering by an at- ology measures the upper-bound that may be achieved by tacker. optimizing each operation within TLS, whether through hardware or software acceleration techniques. We can There are several important points here. First, the TLS measure the upper-bound on a wide variety of possible protocol designers were aware that performing the full optimizations, including radical changes like reducing the setup protocol is quite expensive, requiring two network number of TLS protocol messages. Creating such an op- round-trips (four messages) as well as expensive cryp- timized protocol and proving it to be secure would be a tographic operations, such as the 1024-bit modular ex- significant effort, whereas our simulations let us rapidly ponentiation required of RSA. For this reason, the pre- measure an upper bound on the achievable performance master secret can be stored by both sides in a session benefit. If the benefit were minimal, we would then see cache. When a client subsequently reconnects, it need no need for designing such a protocol. only present a session identifier. Then, the premaster se- Section 2 presents an overview of the TLS protocol. cret (known to client and server but not to any eaves- Section 3 explains how we performed our experiments and dropper) can be used to create a new master secret, a what we measured. Section 4 analyzes our measurements connection-specific value from which the connection’s en- in detail. Our paper wraps up with future work and con- cryption keys, message authentication keys, and initializa- clusions. tion vectors are derived. After the setup protocol is completed, the data exchange phase begins. Prior to transmission, the data is broken into 2. TLS protocol overview packets. For each packet, the packet is optionally com- The TLS protocol, which encompasses everything from pressed, a keyed message authentication code is computed authentication and key management to encryption and in- and added to the message with its sequence number. Fi- tegrity checking, fundamentally has two phases of opera- nally the packet is encrypted and transmitted. TLS also tion: connection setup and steady-state communication. allows for a number of control messages to be transmit- Connection setup in quite complex. Readers looking ted. for complete details are encouraged to read the RFC [10]. Analyzing the above information, we see a number of The setup protocol must, among other things, be strong operations that may form potential performance bottle- against active attackers trying to corrupt the initial negoti- necks. Performance can be affected by the CPU costs of ation where the two sides agree on key material. Likewise, the RSA operations and the effectiveness of the session it must prevent “replay attacks” where an adversary who cache. It can also be affected by the network latency of recorded a previous communication (perhaps one indicat- transmitting the extra connection setup messages, as well ing some money is to be transferred) could play it back as the CPU latency of marshaling, encrypting, decrypting, without the server’s realizing the transaction is no longer unmarshaling, and verifying packets. This paper aims to fresh (and thus, allowing the attacker to empty out the vic- quantify these costs. tim’s bank account). TLS connection setup has the following steps (quoting 3. Methodology from the RFC): We chose not to perform “micro-benchmarks” such as   Exchange hello messages to agree on algorithms, measuring the necessary CPU time to perform specific op- exchange random values, and check for session re- erations. In a system as complex as a web server, I/O and sumption. computation are happening simultaneously and the sys-   tem’s bottleneck is never intuitively obvious. Instead, we Exchange certificates and cryptographic information chose to measure the throughput of the web server un- to allow the client and server to authenticate them- der various conditions. To measure the costs of individ- selves. [In our experiments, we do not use client cer- ual operations, we replaced them with no-ops. Replac- tificates.] ing cryptographically significant operations with no-ops   Exchange the necessary cryptographic parameters to is obviously insecure, but it allows us to measure an upper allow the client and server to agree on a “premaster bound on the performance that would result from optimiz- secret”. ing the system. In effect, we simulate ideal hardware ac- celerators. Based on these numbers, we can estimate the   Generate a “master secret” from the premaster secret relative cost of each operation using Amdahl’s Law (see chosen by the client and exchanged random values. Section 4).
  3. 3. 3.1. Platform The Amazon workload has a small average file size, 7 KB, while the CS trace has a large average file size, 46KB. Our experiments used two different hardware platforms Likewise, the working size of the CS trace is 530MB for the TLS web servers: a generic 500MHz Pentium III while the Amazon trace’s working size is only 279KB. clone and a Compaq DL360 server with a single 933MHz Even with the data stored in RAM buffers, these two Pentium III. Both machines had 1GB of RAM and a gi- configurations provide quite different stresses upon the gabit Ethernet interface. Some experiments also included system. For example, the Amazon trace will likely be a Compaq AXL300 [4] cryptography acceleration board. stored in the CPU’s cache whereas the CS trace will gen- Three generic 800MHz Athlon PCs with gigabit Ethernet erate more memory traffic. The Amazon trace thus places cards served as TLS web clients, and all experiments were similar pressure on the memory system as we might ex- performed using a private gigabit Ethernet switch. pect from dynamically generated HTML (minus the costs All computers ran RedHat Linux 6.2. The stan- of actually fetching the data from an external database dard web servers used were Apache 1.3.14 [2], and server). Likewise, the CS trace may put more stress on the TLS web server was Apache with mod SSL 2.7.1- the bulk ciphers, with its larger files, whereas the Amazon 1.3.14 [14]. We have chosen the Apache mod SSL so- trace would put more pressure on the connection setup lution due to its wide availability and use, as shown by costs, as these connections will be, on average, much a March 2001 survey [26]. The TLS implementation shorter lived. used in our experiments by mod SSL is the open source In addition to replacing cryptographic operations, such OpenSSL 0.9.5a [19]. The HTTPS traffic load was gen- as RSA, RC4, MD5/SHA-1, and secure pseudo-random erated using the methodology of Banga et al. [5], with number generation with no-ops1, we also investigated additional support for OpenSSL. As we are interested pri- replacing the session cache with an idealized “perfect marily in studying the CPU performance bottlenecks aris- cache” that returns the same session every time (thus ing from the use of cryptographic protocols, we needed avoiding contention costs in the shared memory cache). to guarantee that other potential bottlenecks, such as disk Simplifying further, we created a “skeleton TLS” proto- or network throughput, did not cloud our throughput mea- col where all TLS operations have been completely re- surements. To address this, we used significantly more moved but the messages of the same length as the TLS RAM in each computer than it’s working set, and thus handshake are transmitted. This simulates an “infinitely minimizing disk I/O when the disk caches are warm. Like- fast” CPU that still needs to perform all the same network wise, to avoid network contention, we used gigabit Ether- operations. Finally, we hypothesize a faster TLS session net, which provide more bandwidth than the computers in resumption protocol that removes two messages (one net- our study can reasonably generate. work round-trip), and measure its performance. 3.2. Experiments performed Through each of these changes, we can progressively simulate the effects of “perfect” optimizations, identifying We performed four sets of experiments, using two dif- an upper bound on the benefits available from optimizing ferent workload traces against two different machine con- each component of the TLS system. figurations. One workload simulated the secure servers at Ama- 3.2.1. Amazon-like workload experiments Normally, an Amazon customer selects goods to be purchased via a normal web server, and only inter- We were interested in closely simulating the load that acts with a secure web server when submitting credit card might be experienced by a popular e-commerce site, such information and verifying purchase details. We purchased as Amazon. While our experiments do not include the two books at Amazon, one as a new user and one as a database back-end processing that occurs in e-commerce returning user. By replicating the corresponding HTTPS sites, we can still accurately model the front-end web requests in the proportions that they are experienced by server load. Amazon, we can simulate the load that a genuine Amazon To capture an appropriate trace, we configured a Squid secure server might experience. Our other workload was a proxy server and logged the data as we purchased two 100,000-hit trace taken from our departmental web server, books from, one as a new customer and using a 530MB set of files. While our departmental web one as a returning customer. The web traffic to browse server supports only normal, unencrypted web service, we Amazon’s inventory and select the books for purchase oc- measured the throughput for running this trace under TLS curs over a regular web server, and only the final payment to determine the costs that would be incurred if our normal 1 While TLS also supports operating modes which use no encryption web server was replaced with a TLS web server. (e.g., TLS_NULL_WITH_NULL_NULL), our no-op replacements still These two workloads represent endpoints of the work- use the original data structures, even if their values are now all zeros. load spectrum TLS-secured web severs might experience. This results in a more accurate simulation of “perfect” acceleration.
  4. 4. and shipping portion occurs with a secure web server. Of pear to originate from a small number of proxies. To avoid course, the traces we recorded do not contain any plain- an incorrect estimation of the session reuse, we hand- text from the secure web traffic, but they do indicate the deleted all known proxy servers from our traces. The number of requests made and the size of the objects trans- remaining requests could then be assumed to correspond mitted by Amazon to the browser. This is sufficient in- to individual users’ web browsers. The final trace con- formation to synthesize a workload comparable to what tained approximately 11,000 sessions spread over 100,000 Amazon’s secure web servers might experience. The only requests. value we could not directly measure is the ratio of new In our trace playback system, three client machines ran to returning Amazon customers. Luckily, Amazon pro- 20 processes each, generating 60 simultaneous connects, vided this ratio (78% returning customers to 22% new proving sufficient to saturate the server. The complexity customers) in a recent quarterly report [1]. For our exper- of the playback system lies in its attempt to preserve the iments, we assume that returning customers do not retain original ordering of the web requests seen in the original TLS session state, and will thus complete the full TLS trace. Apache’s logging mechanism actually records the handshake every time they wish to make a purchase. In order in which requests complete, not the order in which this scenario, based on our traces, the server must per- they were received. As such, we have insufficient infor- form a full TLS handshake approximately once out of mation to faithfully replay the original trace in its origi- every twelve web requests. This one-full-handshake-per- nal order. Instead, we derive a partial ordering from the purchase assumption may cause us to overstate the relative trace. All requests from a given IP address are totally or- costs of performing full TLS handshakes, but it does rep- dered, but requests from unrelated IP addresses have no resent a “worst case” that could well occur in e-commerce ordering. This allows the system to dispatch requests in workloads. a variety of different orders, but preserves the behavior of We created files on disk to match the sizes collected in individual traces. our trace and request those files in the order they appear in As a second constraint, we wished to enforce an upper the trace. When replaying the traces, each client process bound on how far the final requests observed by the web uses at most four simultaneous web connections, just as server may differ from the order of requests in the origi- common web browsers do. We also group together the hits nal trace. If this bound were too small, it would artificially corresponding to each complete web page (HTML files limit the concurrency that the trace playback system could and inline images) and do not begin issuing requests for exploit. If the bound were too large, there would be less the subsequent page until the current page is completely assurance that the request ordering observed by the exper- loaded. All three client machine run 24 of these processes, imental server accurately reflected the original behavior each, causing the server to experience a load comparable captured in the trace. In practice, we needed to set this to 72 web clients making simultaneous connections. boundary at approximately 10% of the length of the origi- nal trace. Tighter boundaries created situations where the 3.2.2. CS workload experiments server was no longer saturated, and the clients could begin no new requests until some older large request, perhaps for We also wished to measure the performance impact of a very large file, could complete. replacing our departmental web server with a TLS web While this technique does not model the four simulta- server. To do this, we needed to design a system to read neous connections performed by modern web browsers, a trace taken from the original server and adapt it to our it does saturate the server sufficiently that we believe the trace-driven TLS web client. Because we are interested server throughput numbers would not change appreciably. in measuring maximum server throughput, we discarded the timestamps in the server and instead replayed requests 4. Analysis of experimental results from the trace as fast as possible. However, we needed to determine which requests in the original trace would have Figures 1 and 2 show the main results of our exper- required a full TLS handshake and which requests would iments with the Amazon trace and the CS trace, respec- have reused the sessions established by those TLS hand- tively. The achieved throughput is shown on the y axis. shakes. To do this, we assumed that all requests in the For each system configuration labeled along the x-axis, trace that originated at the same IP address corresponded we show two bars, corresponding to the result obtained to one web browser. The first request from a given IP with the 500MHz system and the 933MHz system, respec- address must perform a full TLS handshake. Subsequent tively. requests from that address could reuse the previously ac- Three clusters of bar graphs are shown along the x-axis. quired TLS session. This assumption is clearly false for The left cluster shows three configurations of a complete, large proxy servers that aggregate traffic for many users. functional web server: the Apache HTTP web server For example, all requests from America Online users ap- (Apache), the Apache TLS web server (Apache+TLS),
  5. 5. Label Description of server configuration Apache Apache server Apache+TLS Apache server with TLS Apache+TLS AXL300 Apache server with TLS and AXL300 RSA RSA protected key exchange PKX plain key exchange NULL no bulk cipher (plaintext) RC4 RC4 bulk cipher noMAC no MAC integrity check MD5 MD5 MAC integrity check no cache no session cache shmcache shared-memory based session cache perfect cache idealized session cache (always hits) no randomness no pseudo-random number generation (also: NULL, noMAC) plain no bulk data marshaling (plaintext written directly to the network) fast resume simplified TLS session resume (eliminates one round-trip) Skeleton TLS all messages of correct size, but zero data 2500 2200 ¢¢ 2000 ¦ ¤ 1876 PIII-500Mhz PIII-900Mhz ¢¤§ ¢ ¡ 1370 1480 1500 ¤ 1118 ¨¦¥ ¦ ¥ 976 ¡¥ ¢¥ 967 937 1000 ¢£ ¡¤¨ 901 735 £¡ 750 783 ¢¢¦ 600 £¤£ £ ¦ ¦ 622 £ £ ¢¥§ 585 615 525 ¨¦§ §¨§ 474 490 ¢£¡ 350 467 500 ¦ £¢¡ £¤© ¢¤ 261 305 285 280 147 © 166 ¦¦ ¢¦ 160 ¡¦ 112 63 0 he 00 e e LS e e S e e e e n e e ch ch ch ch ch ch ch ch ai m m TL L3 ac +T pl su su ca ca ca ca ca ca ca ca on Ap AX he e, re re m hm m no o ct ct hm et ch ,n sh sh ac st st fe fe LS el ,s ca 5, ,s fa fa er er Ap AC 5, Sk 5, +T AC D AC ,p ,p n, S, ct D D M oM he ai M fe ,M TL M AC ss oM 4, pl er ac no ,n 4, ne LL C on oM ,p ,n e, C Ap LL ,R om 4, U et ch R ss LL ,n C U ,N SA el X, nd ca ne ,R N U LL Sk SA PK N R ra X, om ct U SA X, fe R PK N no nd R er PK X, ra ,p X, PK PK no ss ne X, om PK nd ra no X, PK Figure 1. Throughput for Amazon trace and different server configurations, on 500MHz and 933MHz servers.
  6. 6. Label Description of server configuration Apache Apache server Apache+TLS Apache server with TLS Apache+TLS AXL300 Apache server with TLS and AXL300 RSA RSA protected key exchange PKX plain key exchange NULL no bulk cipher (plaintext) RC4 RC4 bulk cipher noMAC no MAC integrity check MD5 MD5 MAC integrity check no cache no session cache shmcache shared-memory based session cache perfect cache idealized session cache (always hits) no randomness no pseudo-random number generation (also: NULL, noMAC) plain no bulk data marshaling (plaintext written directly to the network) fast resume simplified TLS session resume (eliminates one round-trip) Skeleton TLS all messages of correct size, but zero data 1000 885 900 # 824 PIII-500 Mhz 800 PIII-900 Mhz 755 700 610 ! 579 600 566 ! 544 ! 494 509 500 456 464 447 380 387 400 326 ! ! 334 317 309 ! 295 301 285 301 300 259 178 ! 194 ! 172 175 200 149 ! 95 100 48 0 S e 00 e S he e e e e e e e in e ch ch ch ch ch ch ch $ um TL ch TL m la L3 $ c a ca ca su ca ca ca ca ca p ca e+ % s n Ap AX e, re re to m hm hm ch o ct no hm ct ch $ ,n le sh % % t fe S fe st as a ,s s % % % e 5, ,s % $ ca TL $ Ap er AC fa er 5, Sk % 5, % f AC D AC ,p ,p S, % % e+ D D ct , M oM in % M ,M $ fe TL M AC ss ch la % oM % 4, $ $ er no ,n % 4, p ne % LL a C n oM % ,n ,p % % C e, Ap LL to ,R 4, m % U R ch LL % ss le ,n % C o U ,N SA % X, e nd ca ,R N ne U LL % % Sk SA PK N R ra X, % m SA U ct X, R PK o N fe no % nd R $ PK er X, % ra X, ,p % PK PK no ss % ne X, m PK o nd ra no % X, PK Figure 2. Throughput for CS trace and different server configurations, on 500MHz and 933MHz servers.