Slides from the Linux Conference Australia 2021 conference https://linux.conf.au/schedule/presentation/64/ .
Tempesta TLS is an implementation of TLS handshakes for the Linux kernel. Since the kernel already provides symmetric ciphers, we focus on asymmetric cryptography only, elliptic curves in particular.
Use used the mbed TLS library as the foundation and almost fully rewrote it to make is x40 faster. During our development we also use parts of WolfSSL library. While WolfSSL outperforms OpenSSL, it uses the same algorithms, which are 5-7 years of old. Tempesta TLS uses newer and more efficient algorithms from the modern cryptography research.
While we still improving performance of Tempesta TLS, the implementation already establishes 40-80% more TLS handshakes per second than OpenSSL/Nginx and provides up to x4 lower latency in several tests.
This talk covers following topics with plenty of benchmarks:
* The fundamentals of elliptic curve computations and the most "hot spots"
* Side channel attacks (SCA) and methods to prevent them
* How the recent CPU vulnerabilities impact TLS handshakes
* Basics of the new fast algorithms used in the Tempesta TLS
* The design trade offs in OpenSSL, WolfSSL, mbed TLS, and Tempesta TLS
We leave in the era where the atomic building elements of silicon computers, e.g., transistors and wires, are no longer visible using traditional optical microscopes and their sizes are measured in just tens of Angstroms. In addition, power dissipation per unit volume is bounded by the laws of Physics that all resulted among others in stagnating processor clock frequencies. Adding more and more processor cores that perform simpler and simpler tasks in an attempt to efficiently fill the available on-chip area seems to be the current trend taken by the Industry.
Use C++ and Intel® Threading Building Blocks (Intel® TBB) for Hardware Progra...Intel® Software
In this presentation, we focus on an alternative approach that uses nodes that contain Intel® Xeon® processors and Intel® Xeon Phi™ coprocessors. Programming models and the development tools are identical for these resources, greatly simplifying development. We discuss how the same models for vectorization and threading can be used across these compute resources to create software that performs well on them. We further propose an extension to the Intel® Threading Building Blocks (Intel® TBB) flow graph interface that enables intra-node distributed memory programming, simplifying communication, and load balancing between the processors and coprocessors. Finally, we validate this approach by presenting a benchmark of a risk analysis implementation that achieves record-setting performance.
We leave in the era where the atomic building elements of silicon computers, e.g., transistors and wires, are no longer visible using traditional optical microscopes and their sizes are measured in just tens of Angstroms. In addition, power dissipation per unit volume is bounded by the laws of Physics that all resulted among others in stagnating processor clock frequencies. Adding more and more processor cores that perform simpler and simpler tasks in an attempt to efficiently fill the available on-chip area seems to be the current trend taken by the Industry.
Use C++ and Intel® Threading Building Blocks (Intel® TBB) for Hardware Progra...Intel® Software
In this presentation, we focus on an alternative approach that uses nodes that contain Intel® Xeon® processors and Intel® Xeon Phi™ coprocessors. Programming models and the development tools are identical for these resources, greatly simplifying development. We discuss how the same models for vectorization and threading can be used across these compute resources to create software that performs well on them. We further propose an extension to the Intel® Threading Building Blocks (Intel® TBB) flow graph interface that enables intra-node distributed memory programming, simplifying communication, and load balancing between the processors and coprocessors. Finally, we validate this approach by presenting a benchmark of a risk analysis implementation that achieves record-setting performance.
Fast Insights to Optimized Vectorization and Memory Using Cache-aware Rooflin...Intel® Software
Integrated into Intel® Advisor, Cache-aware Roofline Modeling (CARM) provides insight into how an application behaves by helping to determine a) how optimally it works on a given hardware, b) the main factors that limit performance, c) if the workload is memory or compute-bound, and d) the right strategy to improve application performance.
En esta charla presentaremos las ultimas propuestas del Barcelona Supercomputing Center (BSC) al modelo de programación paralela OpenMP relacionadas con el modelo de tareas. Nos centraremos en las oportunidades que presentan dichas extensiones al runtime que da soporte a la ejecución paralela y al co-diseño de arquitecturas runtime-aware. En la ultima parte de la charla se presentará como dicho modelo basado en tareas forma el eje central de las dos asignaturas de paralelismo en la Facultad de Informatica de Barcelona (FIB) de la Universitat Politècnica de Catalunya (UPC).
Parallel Implementation of K Means Clustering on CUDAprithan
K-Means clustering is a popular clustering algorithm in data mining. Clustering large data sets can be
time consuming, and in an attempt to minimize this time, our project is a parallel implementation of KMeans
clustering algorithm on CUDA using C. We present the performance analysis and implementation
of our approach to parallelizing K-Means clustering.
Data Analytics and Simulation in Parallel with MATLAB*Intel® Software
This talk covers the current parallel capabilities in MATLAB*. Learn about its parallel language and distributed and tall arrays. Interact with GPUs both on the desktop and in the cluster. Combine this information into an interesting algorithmic framework for data analysis and simulation.
Pysense: wireless sensor computing in Python?Davide Carboni
PySense aims at bringing wireless sensor (and "internet of things") macroprogramming to the audience of Python programmers. WSN macroprogramming is an emerging approach where the network is seen as a whole and the programmer focuses only on the application logic. The PySense runtime environment
partitions the code and transmits code snippets to the right nodes finding a balance between energy
consumption and computing performances.
IJERA (International journal of Engineering Research and Applications) is International online, ... peer reviewed journal. For more detail or submit your article, please visit www.ijera.com
Hardware Implementations of RS Decoding Algorithm for Multi-Gb/s Communicatio...RSIS International
In this paper, we have designed the VLSI hardware for a novel RS decoding algorithm suitable for Multi-Gb/s Communication Systems. Through this paper we show that the performance benefit of the algorithm is truly witnessed when implemented in hardware thus avoiding the extra processing time of Fetch-Decode-Execute cycle of traditional microprocessor based computing systems. The new algorithm with less time complexity combined with its application specific hardware implementation makes it suitable for high speed real-time systems with hard timing constraints. The design is implemented as a digital hardware using VHDL
OpenMetadata Community Meeting - 5th June 2024OpenMetadata
The OpenMetadata Community Meeting was held on June 5th, 2024. In this meeting, we discussed about the data quality capabilities that are integrated with the Incident Manager, providing a complete solution to handle your data observability needs. Watch the end-to-end demo of the data quality features.
* How to run your own data quality framework
* What is the performance impact of running data quality frameworks
* How to run the test cases in your own ETL pipelines
* How the Incident Manager is integrated
* Get notified with alerts when test cases fail
Watch the meeting recording here - https://www.youtube.com/watch?v=UbNOje0kf6E
AI Fusion Buddy Review: Brand New, Groundbreaking Gemini-Powered AI AppGoogle
AI Fusion Buddy Review: Brand New, Groundbreaking Gemini-Powered AI App
👉👉 Click Here To Get More Info 👇👇
https://sumonreview.com/ai-fusion-buddy-review
AI Fusion Buddy Review: Key Features
✅Create Stunning AI App Suite Fully Powered By Google's Latest AI technology, Gemini
✅Use Gemini to Build high-converting Converting Sales Video Scripts, ad copies, Trending Articles, blogs, etc.100% unique!
✅Create Ultra-HD graphics with a single keyword or phrase that commands 10x eyeballs!
✅Fully automated AI articles bulk generation!
✅Auto-post or schedule stunning AI content across all your accounts at once—WordPress, Facebook, LinkedIn, Blogger, and more.
✅With one keyword or URL, generate complete websites, landing pages, and more…
✅Automatically create & sell AI content, graphics, websites, landing pages, & all that gets you paid non-stop 24*7.
✅Pre-built High-Converting 100+ website Templates and 2000+ graphic templates logos, banners, and thumbnail images in Trending Niches.
✅Say goodbye to wasting time logging into multiple Chat GPT & AI Apps once & for all!
✅Save over $5000 per year and kick out dependency on third parties completely!
✅Brand New App: Not available anywhere else!
✅ Beginner-friendly!
✅ZERO upfront cost or any extra expenses
✅Risk-Free: 30-Day Money-Back Guarantee!
✅Commercial License included!
See My Other Reviews Article:
(1) AI Genie Review: https://sumonreview.com/ai-genie-review
(2) SocioWave Review: https://sumonreview.com/sociowave-review
(3) AI Partner & Profit Review: https://sumonreview.com/ai-partner-profit-review
(4) AI Ebook Suite Review: https://sumonreview.com/ai-ebook-suite-review
#AIFusionBuddyReview,
#AIFusionBuddyFeatures,
#AIFusionBuddyPricing,
#AIFusionBuddyProsandCons,
#AIFusionBuddyTutorial,
#AIFusionBuddyUserExperience
#AIFusionBuddyforBeginners,
#AIFusionBuddyBenefits,
#AIFusionBuddyComparison,
#AIFusionBuddyInstallation,
#AIFusionBuddyRefundPolicy,
#AIFusionBuddyDemo,
#AIFusionBuddyMaintenanceFees,
#AIFusionBuddyNewbieFriendly,
#WhatIsAIFusionBuddy?,
#HowDoesAIFusionBuddyWorks
Graspan: A Big Data System for Big Code AnalysisAftab Hussain
We built a disk-based parallel graph system, Graspan, that uses a novel edge-pair centric computation model to compute dynamic transitive closures on very large program graphs.
We implement context-sensitive pointer/alias and dataflow analyses on Graspan. An evaluation of these analyses on large codebases such as Linux shows that their Graspan implementations scale to millions of lines of code and are much simpler than their original implementations.
These analyses were used to augment the existing checkers; these augmented checkers found 132 new NULL pointer bugs and 1308 unnecessary NULL tests in Linux 4.4.0-rc5, PostgreSQL 8.3.9, and Apache httpd 2.2.18.
- Accepted in ASPLOS ‘17, Xi’an, China.
- Featured in the tutorial, Systemized Program Analyses: A Big Data Perspective on Static Analysis Scalability, ASPLOS ‘17.
- Invited for presentation at SoCal PLS ‘16.
- Invited for poster presentation at PLDI SRC ‘16.
Enterprise Resource Planning System includes various modules that reduce any business's workload. Additionally, it organizes the workflows, which drives towards enhancing productivity. Here are a detailed explanation of the ERP modules. Going through the points will help you understand how the software is changing the work dynamics.
To know more details here: https://blogs.nyggs.com/nyggs/enterprise-resource-planning-erp-system-modules/
Enhancing Research Orchestration Capabilities at ORNL.pdfGlobus
Cross-facility research orchestration comes with ever-changing constraints regarding the availability and suitability of various compute and data resources. In short, a flexible data and processing fabric is needed to enable the dynamic redirection of data and compute tasks throughout the lifecycle of an experiment. In this talk, we illustrate how we easily leveraged Globus services to instrument the ACE research testbed at the Oak Ridge Leadership Computing Facility with flexible data and task orchestration capabilities.
Do you want Software for your Business? Visit Deuglo
Deuglo has top Software Developers in India. They are experts in software development and help design and create custom Software solutions.
Deuglo follows seven steps methods for delivering their services to their customers. They called it the Software development life cycle process (SDLC).
Requirement — Collecting the Requirements is the first Phase in the SSLC process.
Feasibility Study — after completing the requirement process they move to the design phase.
Design — in this phase, they start designing the software.
Coding — when designing is completed, the developers start coding for the software.
Testing — in this phase when the coding of the software is done the testing team will start testing.
Installation — after completion of testing, the application opens to the live server and launches!
Maintenance — after completing the software development, customers start using the software.
Essentials of Automations: The Art of Triggers and Actions in FMESafe Software
In this second installment of our Essentials of Automations webinar series, we’ll explore the landscape of triggers and actions, guiding you through the nuances of authoring and adapting workspaces for seamless automations. Gain an understanding of the full spectrum of triggers and actions available in FME, empowering you to enhance your workspaces for efficient automation.
We’ll kick things off by showcasing the most commonly used event-based triggers, introducing you to various automation workflows like manual triggers, schedules, directory watchers, and more. Plus, see how these elements play out in real scenarios.
Whether you’re tweaking your current setup or building from the ground up, this session will arm you with the tools and insights needed to transform your FME usage into a powerhouse of productivity. Join us to discover effective strategies that simplify complex processes, enhancing your productivity and transforming your data management practices with FME. Let’s turn complexity into clarity and make your workspaces work wonders!
Need for Speed: Removing speed bumps from your Symfony projects ⚡️Łukasz Chruściel
No one wants their application to drag like a car stuck in the slow lane! Yet it’s all too common to encounter bumpy, pothole-filled solutions that slow the speed of any application. Symfony apps are not an exception.
In this talk, I will take you for a spin around the performance racetrack. We’ll explore common pitfalls - those hidden potholes on your application that can cause unexpected slowdowns. Learn how to spot these performance bumps early, and more importantly, how to navigate around them to keep your application running at top speed.
We will focus in particular on tuning your engine at the application level, making the right adjustments to ensure that your system responds like a well-oiled, high-performance race car.
Utilocate offers a comprehensive solution for locate ticket management by automating and streamlining the entire process. By integrating with Geospatial Information Systems (GIS), it provides accurate mapping and visualization of utility locations, enhancing decision-making and reducing the risk of errors. The system's advanced data analytics tools help identify trends, predict potential issues, and optimize resource allocation, making the locate ticket management process smarter and more efficient. Additionally, automated ticket management ensures consistency and reduces human error, while real-time notifications keep all relevant personnel informed and ready to respond promptly.
The system's ability to streamline workflows and automate ticket routing significantly reduces the time taken to process each ticket, making the process faster and more efficient. Mobile access allows field technicians to update ticket information on the go, ensuring that the latest information is always available and accelerating the locate process. Overall, Utilocate not only enhances the efficiency and accuracy of locate ticket management but also improves safety by minimizing the risk of utility damage through precise and timely locates.
Introducing Crescat - Event Management Software for Venues, Festivals and Eve...Crescat
Crescat is industry-trusted event management software, built by event professionals for event professionals. Founded in 2017, we have three key products tailored for the live event industry.
Crescat Event for concert promoters and event agencies. Crescat Venue for music venues, conference centers, wedding venues, concert halls and more. And Crescat Festival for festivals, conferences and complex events.
With a wide range of popular features such as event scheduling, shift management, volunteer and crew coordination, artist booking and much more, Crescat is designed for customisation and ease-of-use.
Over 125,000 events have been planned in Crescat and with hundreds of customers of all shapes and sizes, from boutique event agencies through to international concert promoters, Crescat is rigged for success. What's more, we highly value feedback from our users and we are constantly improving our software with updates, new features and improvements.
If you plan events, run a venue or produce festivals and you're looking for ways to make your life easier, then we have a solution for you. Try our software for free or schedule a no-obligation demo with one of our product specialists today at crescat.io
First Steps with Globus Compute Multi-User EndpointsGlobus
In this presentation we will share our experiences around getting started with the Globus Compute multi-user endpoint. Working with the Pharmacology group at the University of Auckland, we have previously written an application using Globus Compute that can offload computationally expensive steps in the researcher's workflows, which they wish to manage from their familiar Windows environments, onto the NeSI (New Zealand eScience Infrastructure) cluster. Some of the challenges we have encountered were that each researcher had to set up and manage their own single-user globus compute endpoint and that the workloads had varying resource requirements (CPUs, memory and wall time) between different runs. We hope that the multi-user endpoint will help to address these challenges and share an update on our progress here.
Climate Science Flows: Enabling Petabyte-Scale Climate Analysis with the Eart...Globus
The Earth System Grid Federation (ESGF) is a global network of data servers that archives and distributes the planet’s largest collection of Earth system model output for thousands of climate and environmental scientists worldwide. Many of these petabyte-scale data archives are located in proximity to large high-performance computing (HPC) or cloud computing resources, but the primary workflow for data users consists of transferring data, and applying computations on a different system. As a part of the ESGF 2.0 US project (funded by the United States Department of Energy Office of Science), we developed pre-defined data workflows, which can be run on-demand, capable of applying many data reduction and data analysis to the large ESGF data archives, transferring only the resultant analysis (ex. visualizations, smaller data files). In this talk, we will showcase a few of these workflows, highlighting how Globus Flows can be used for petabyte-scale climate analysis.
Understanding Nidhi Software Pricing: A Quick Guide 🌟
Choosing the right software is vital for Nidhi companies to streamline operations. Our latest presentation covers Nidhi software pricing, key factors, costs, and negotiation tips.
📊 What You’ll Learn:
Key factors influencing Nidhi software price
Understanding the true cost beyond the initial price
Tips for negotiating the best deal
Affordable and customizable pricing options with Vector Nidhi Software
🔗 Learn more at: www.vectornidhisoftware.com/software-for-nidhi-company/
#NidhiSoftwarePrice #NidhiSoftware #VectorNidhi
Zoom is a comprehensive platform designed to connect individuals and teams efficiently. With its user-friendly interface and powerful features, Zoom has become a go-to solution for virtual communication and collaboration. It offers a range of tools, including virtual meetings, team chat, VoIP phone systems, online whiteboards, and AI companions, to streamline workflows and enhance productivity.
Mathematics and development of fast TLS handshakes
1. Mathematics and development of fast
TLS handshakes
Alexander Krizhanovsky
Tempesta Technologies, Inc.
ak@tempesta-tech.com
2. Web content delivery & protection
2013: WAF development by request of Positive Technologies
“Visionar” from Gartner magic quadrant’15
●
Web attacks
●
L7 HTTP/HTTPS DDoS attacks
Nginx, HAProxy, etc. - perfect HTTP proxies,
not HTTP filters
Netfilter works in TCP/IP stack (softirq)
=> HTTP(S)/TCP/IP stack
Tempesta FW:
●
hybrid of HTTP accelerator & firewall
●
embedded into the Linux TCP/IP stack
4. Linux kernel TLS handshaks
Very fast light-weight Linux kernel implementation
●
...even for session resumption
●
there is modern research in the field
Resistant against DDoS on TLS handshakes (asymmetric DDoS)
Privileged address space for sensitive security data
●
Varnish: TLS is processed in separate process Hitch
http://varnish-cache.org/docs/trunk/phk/ssl.html
●
Resistance against attacks like CloudBleed
https://blog.cloudflare.com/incident-report-on-memory-leak-caused-by-cloudflare-parser-bug/
5. Why NIST p256?
ECDSA
●
RSA and NIST curves p256, p384, and p521 are the only allowed for
CA certificates
https://cabforum.org/baseline-requirements-documents/
●
P256 is the fastest NIST curve
●
P521 isn’t recommended by IANA
https://www.iana.org/assignments/tls-parameters/tls-parameters.xml#tls-parameters-8
●
RSA is slow and vulnerable to asymmetric DDoS
https://vincent.bernat.im/en/blog/2011-ssl-dos-mitigation
Curve25519
●
Much faster than NIST p256
●
In practice for ECDHE only
7. The source code
https://github.com/tempesta-tech/tempesta/tree/master/tls
Still in-progress: we implement some of the algorithms on our own
Initially the fork of mbed TLS 2.8.0 (https://tls.mbed.org/) - x40 faster!
●
very portable and easy to move into the kernel
●
cutting edge security
●
too many memory allocations (https://github.com/tempesta-tech/tempesta/issues/614)
●
big integer abstractions (https://github.com/tempesta-tech/tempesta/issues/1064)
●
inefficient algorithms, no architecture-specific implementations, ...
We also take parts from WolfSSL (https://github.com/wolfSSL/wolfssl/)
●
very fast, but not portable
●
security https://github.com/wolfSSL/wolfssl/issues/3184
9. Demo!
Tempesta TLS, Nginx-1.14.2/OpenSSL-1.1.1d, Nginx-1.17.8/WolfSSL
TLS 1.2
●
full handshakes
●
abbreviated handshakes
tls-perf
https://github.com/tempesta-tech/tls-perf
●
establish & drop many TLS connections in parallel
●
like TLS-THC-DOS, but faster, more flexible, more options
10. Data for proprietary vendors
BIG-IP is only 30-50% faster than Nginx/OpenSSL/DPDK
https://www.youtube.com/watch?v=Plv87h8GtLc
Avi Vantage (VMware) makes ~2000 handshakes/second per 1CPU
https://avinetworks.com/docs/latest/ssl-performance/
11. Why faster?
No memory allocations in run time
No context switches
No copies on socket I/O
Less message queues
Zero-copy handshakes state machine
https://netdevconf.info/0x12/session.html?kernel-tls-handhakes-for-https-ddos-mitigation
State of the art cryptography mathematics
12. Elliptic curve cryptography
Secp256r1: y2
= x3
- 3x + b defined over the field GF(p)
p256 = 2256
– 2224
+ 2192
+ 296
- 1
The group law
● negatives: if P = (x, y),
then -P = (x, -y)
● addition: R = P + Q
● doubling: R = P + P = 2*P
ECDSA: k – secure random, G – known point
k * G is used for the signature
ECDH: d – private key, Q – public key
shared secret: d * Q
13. Point multiplication
OpenSSL: “Fast prime field elliptic-curve cryptography with 256-bit primes" by Gueron and Krasnov
Q = m * P - the most expensive elliptic curve operation
for i in bits(m):
Q ← point_double(Q)
if mi == 1:
Q ← point_add(Q, P)
Point multiplications in TLS handshake:
●
known point multiplication: precompute the table for doubled G
●
perfect forward secrecy ECDHE: generate keys G * d (d – random)
●
handshake: 2 known & 1 unknown point multiplications
14. Point representation and coordinate systems
http://www.hyperelliptic.org/EFD/g1p/auto-shortw-jacobian.html
“Analysis and optimization of elliptic-curve single-scalar multiplication”, Bernstein & Lange,
2007
Jacobian coordinates (rough estimations)
● conversion overhead: 39*M + 4*S + 3*I (for w(-indow) = 4)
● point addition (mixed) - 8*M + 3*S, doubling - 2*M + 4*S
Affine coordinates (rough estimations)
● point addition - 13*M + 4*S, doubling - 4*M + 5*S
NIST 256 bits, D = 256 / w = 64 Comb rounds (addition & doubling):
64 * (10*M + 7*S) << 64 * (17*M + 9*S)
16. The cost
Modular multiplication (M) is the most expensive basic scalar operation
Modular squaring (S) faster than M, usually 0.8M (Montgomery)
(0.9 for optimized FIPS due to more expensive modular reduction)
Modular inversion (I) is very expensive, about 100M
17. Modular arithmetics
ex. prime field F(29)
addition: 17 + 20 = 8 since 37 mod 29 = 8
subtraction: 17 − 20 = 26 since −3 mod 29 = 26
multiplication: 17 * 20 = 21 since 340 mod 29 = 21
inversion: 17−1
= 12 since 17 · 12 mod 29 = 1
Montgomery reduction (the most used)
●
there is some overhead, but each modular operation is cheaper
FIPS reduction
●
Can be faster if small number of modular operations is used
●
There are optimization techniques,
e.g.“Low-Latency Elliptic Curve Scalar Multiplication” Bos, 2012
●
But still about 65% slower than Montgomery reduction
18. Montgomery multiplication in P256
“Montgomery Multiplication”, Henry S. Warren, Jr.
Fast 256-bit integer multiplication with modular reduction on P256
a, b < m (m - modulus P256)
Set n = 2256
Transform multipliers to Montgomery domain (overhead):
a’ = an mod m b’ = bn mod m
Fast multiplication with reduction: u = a’b’/n mod m
● compute only 256 bits of (a’b’ + (-m-1
a’b’ mod n)m)/n
● if u > m, then u ← u – m (unconditionally, carry as a mask)
Convert to ordinary number: v = un -1
mod m
19. The math layers
Different multiplication algorithms for fixed and unknown point
●
”Efficient Fixed Base Exponentiation and Scalar Multiplication based on a Multiplicative
Splitting Exponent Recoding”, Robert et al 2019
Point doubling and addition - everyone seems use the same algorithms
Jacobian coordinates: different modular inversion algorithms
●
“Fast constant-time gcd computation and modular inversion”, Bernstein et al, 2019
Modular reduction for scalar multiplications:
●
Montgomery has overhead vs FIPS speed: if we use less multiplcations it could make
sense to use different reduction method FIPS (seems deadend)
●
“Low-Latency Elliptic Curve Scalar Multiplication” Bos, 2012
=> Balance between all the layers
20. Example of math layers balancing
For w=5 we need 52 point additions for an unknown point multiplication
Jacobian coordinates addition takes 11M + 5S
Affine-Jacobian coordinates addition takes 8M + 3S
● about 4.4M cheaper if S = 0.8M
●
requires 2 coordinates normalizations for Comba precomputation
● coordinates normalization: 2w - 1
* (6M + 1S) + 1I
Almost the same for S = 0.9M and fast inversion I < 100M
Montgomery arithmetics (S = 0.8M):
●
ECDHE +28% and ECDSA +6% performance
21. Side channel attacks (SCA) resistance
Timing attacks, simple power analysis, differential power analysis etc.
Protections against SCAs:
●
Constant time algorithms
●
Dummy operations
●
Point randomization
●
e.g. modular inversion 741 vs 266 iterations
RDRAND allows to write faster non-constant time algorithms
●
SRBDS mitigation costs about 97% performance
https://software.intel.com/security-software-guidance/insights/processors-affected-special-register-buffe
r-data-sampling?fbclid=IwAR1ifj3ZuAtNOabKkj3vFItBLSvOnMqlxH2I-QeN5KB-aji54J1BCJa9lLk
https://www.phoronix.com/scan.php?page=news_item&px=RdRand-3-Percent&fbclid=IwAR2vmmR_Lir
oekUuw7KMRaHB7KThpqz0tIr1fX2GCW3HAvwt5Kb1p9xpLKo
22. Memory usage & SCA
ex. ECDSA precomputed table for fixed point multiplication
●
mbed TLS: ~8KB dynamically precomputed table, point
randomization, constant-time algorithm, full table scan
●
OpenSSL: ~150KB static table, full scan
●
WolfSSL: ~150KB, direct index access (fixed in the new version)
https://github.com/wolfSSL/wolfssl/issues/3184
=> 150KB is far larger than L1d cache size, so many cache misses:
23. Big Integers (aka MPIs)
“BigNum Math: Implementing Cryptographic Multiple Precision Arithmetic”, by Tom St Denis
All the libraries use them (not in hot paths), mbed TLS overuses them
linux/lib/mpi/, linux/include/linux/mpi.h
typedef unsigned long int mpi_limb_t;
struct gcry_mpi {
int alloced; /* array size (# of allocated limbs) */
int nlimbs; /* number of valid limbs */
int nbits; /* the real number of valid bits (info only) */
int sign; /* indicates a negative number */
unsigned flags;
mpi_limb_t *d; /* array with the limbs */
};
Need to manage variable-size integers
=> size-specific assembly implemetations
24. Easy assembly
// a := a + b
// x[0] is the less significant limb,
// x[1] is the most significant limb.
void s_mp_add(unsigned long *a, unsigned long *b) {
unsigned long carry;
a[0] += b[0];
carry = (a[0] < b[0]);
a[1] += b[1] + carry;
}
// Pointer to a is in %RDI, pointer to b is in %RSI
movq (%rdi), %r8
movq 8(%rdi), %r9
addq (%rsi), %r8 // add with carry
addc 8(%rsi), %r9 // use the carry in the next addition
movq (%r8), (%rdi)
movq (%r9), 8(%rdi)
25. Open questions and further research
Ice Lake CPUs have negligeable downclocking on AVX-512
https://travisdowns.github.io/blog/2020/08/19/icl-avx512-freq.html
Parallel Montgomery computations
J.W.Bos, "Montgomery Arithmetic from a Software Perspective", 2017
●
SIMD multiplications & squarings of two and more products
●
Interleaved Montgomery multiplications
Better methods for point multiplications
26. Going to the Linux kernel upstream
Details and the discussion
●
https://netdevconf.info/0x14/session.html?talk-performance-study-of-kernel-TLS-handshakes
●
https://github.com/tempesta-tech/tempesta/issues/1433
Server-side only
The full TLS handshake is in softirq (just like TCP)
Fallback to a user-space TLS library on ClientHello
Batches of handshakes in 1 FPU context
27. TODO
More cryptography mathematics performance optimizations
https://github.com/tempesta-tech/tempesta/issues/1064
https://github.com/tempesta-tech/tempesta/issues/1335
TLS 1.3
https://github.com/tempesta-tech/tempesta/issues/1031
Moving to the kernel asymmetric keys API
https://github.com/tempesta-tech/tempesta/issues/1332
The Linux kernel /crypto API performance issues
●
SHA-256 (crucial for TLS handshake) 30-100% slower than OpenSSL
https://github.com/tempesta-tech/tempesta/issues/1483
●
Extra copying and memory allocations in kTLS
https://github.com/tempesta-tech/tempesta/issues/1064
28. Netdev papers about Tempesta TLS
“Kernel HTTP/TCP/IP stack for HTTP DDoS mitigation”, Netdev 2.1,
https://netdevconf.info/2.1/session.html?krizhanovsky
“Kernel TLS handshakes for HTTPS DDoS mitigation”, Netdev 0x12,
https://netdevconf.info/0x12/session.html?kernel-tls-handhakes-for-https-ddos-mitigation
“Performance study of kernel TLS handshakes”, Netdev 014,
https://netdevconf.info/0x14/session.html?talk-performance-study-of-kernel-TLS-handsh
akes
29. Thanks!
Contact us if you’re interested in fast Linux kernel TLS!
https://github.com/tempesta-tech/tempesta
ak@tempesta-tech.com