2014 
1. “Beyond Modes: Building a Secure Record Protocol from a Cryptographic Sponge 
Permutation.” CT-RSA 2014. LNCS 8366, pp. 270–285, Springer (2014) 
2. “CBEAM: Efficient Authenticated Encryption from Feebly One-Way ϕ Functions.” 
CT-RSA 2014. LNCS 8366, pp. 251–269, Springer (2014) 
3. “STRIBOB: Authenticated Encryption from GOST R 34.11-2012 LPS 
Permutation.” CTCrypt ’14. To appear in Mathematical Aspects of Cryptography, 
Steklov Mathematical Institute of RAS (2014) 
4. “Simple AEAD Hardware Interface (SÆHI) in a SoC: Implementing an On-Chip 
Keyak/WhirlBob Coprocessor.” TrustED 2014, ACMCCS 2014Workshops, 03 
November 2014, Scottsdale AZ US. To appear. ACM(2014) 
5. “Lighter, Faster, and Constant-Time: WHIRLBOB, the Whirlpool variant of 
STRIBOB.” With Billy Bob Brumley. IACR ePrint 2014/501. Submitted (2014) 
6. “BRUTUS: Identifying CryptanalyticWeaknesses in CAESAR First Round 
Candidates.” IACR ePrint 2014/850. Submitted (2014) 
+ Invited Talks. 1/17
Simple AEAD Hardware Interface (SÆHI) in a SoC: 
Implementing an On-Chip Keyak/WhirlBob Coprocessor 
Dr. Markku-JuhaniO. Saarinen 
mjos@item.ntnu.no 
NORWEGIAN UNIVERSITY OF SCIENCE AND TECHNOLOGY 
TrustED ’14 – 03 November 2014 – Scottsdale AZ 
2/17
Authenticated Encryption with Associated Data 
An Authenticated Encryption with Associated Data (AEAD) primitive provides: 
▶ Encryption or confidentiality / privacy protection, and 
▶ Authentication or integrity protection for encrypted and associated data. 
Preferably in a single pass over the data. 
Security protocols such as IPSec and SSL/TLS usually required two processing steps 
for each packet in 1990’s and 200x’s. 
▶ Authentication was handled with a HMAC (Hash Message Authentication Code). 
▶ Encryption was provided either with block cipher such as 3DES-CBC or 
AES-CBC or a stream cipher such as RC4. 
Hardware implementation of such a twin set-up is cumbersome. 
Transition to AE has been swift during recent years because of ACM-GCM’s status in 
Suite B (Classified COTS) and many attacks like CRIME, LUCKY13, POODLE. 
3/17
Background: CAESAR project for new AEAD algorithms 2014-2017 
NIST - sponsored international Competition for Authenticated 
Encryption: Security, Applicability, and Robustness. 
http://competitions.cr.yp.to/caesar-call.html 
▶ Jan 2013 Announced by Dan Bernstein (secretary) 
▶ Mar 2014 Deadline for first-round submissions (57) 
▶ May 2014 Deadline for first-round software 
▶ Aug 2014 DIAC ’14Workshop, UCSB 
▶ Jan 2015 Second round candidates announced 
▶ Feb 2015 Second round tweaks (fixes) 
▶ Feb 2015 Second round Verilog / VHDL (this talk) 
▶ Dec 2015 Third round candidates 
▶ Dec 2016 Final round candidates 
▶ Dec 2017 Final CAESAR portfolio announcement 
4/17
Hardware API for Authenticated Encryption 
CAESAR candidates came in many shapes and sizes. Here’s a rough breakdown: 
8 are clearly based on a SHA3-style Sponge construction. 
9 are (somehow) constructed from AES components. 
19 are AES modes of operation. 
21 are based on other design paradigms or are entirely ad hoc. 
We want consistent testing across second round candidates. 
Signalling. How to communicate with the hardware ? Can a consistent, high-level 
“hardware API” be constructed ? 
Memory access. Some prominent proposals (AEZ and SIV) require two passes over the 
data, so APIs in the style of hash functions don’t really work. 
What to test. Realistic test profiles via operating system and application integration. 
5/17
System-on-Chip (SoC) Designs 
Total global shipments 2014 (million units) 
1241.664 
314.065 
853.829 
Android Other mobile PC total 
Majority of Internet and 
communication devices are 
Android Linux - based tablets 
or smart phones. 
System-on-Chip (SoC) designs integrate all the necessary 
components of a computing application on a single chip. 
Mobile electronics such as (smart) phones and tablets are 
built on SoCs. Also used in found in Internet-of-Things (IoT) 
appliances, modems, routers, home media, cars, etc. 
Security of transmitted and stored data is even more relevant 
to mobile devices than to traditional PC systems. 
Limited CPU performance. 
Energy efficiency critical. 
Coprocessors: Audio and video codecs, RF processing, 3D 
display rendering, M7/CCP motion, natural language, etc. 
!Our evaluation target. 
6/17
Zynq-7000 FPGA Artix 7 / ARM Cortex A9 SoC 
2x 
SPI 
2x 
I2C 
2x 
CAN 
2x 
UART 
GPIO 
2x SDIO 
with DMA 
2x USB 
with DMA 
2x GigE 
with DMA 
Processing System 
AMBA® Interconnect AMBA Interconnect 
ARM®CoreSight™Multi-Core Debug and Trace 
NEON™DSP/FPUEngine NEONDSP/FPUEngine 
Cortex ™- A9 MPCore 
Cortex- A9 MPCore 
32/32 KB I/D Caches 
32/32 KB I/D Caches 
EMIO General Purpose ACP 
XADC 
2x ADC, Mux, 
Thermal Sensor 
AXI Ports 
High Performance 
AXI Ports 
PCIe Gen2 
1-8 Lanes 
Security 
AES, SHA, RSA 
Programmable Logic 
(System Gates, DSP, RAM) 
Multi-Standard I/Os (3.3V & High-Speed 1.8V) Multi-Gigabit Transceivers 
Processor I/O Mux 
Flash Controller 
NOR, NAND, SRAM, Quad SPI 
Multiport DRAM Controller 
DDR3, DDR3L, DDR2 
Configuration Timers DMA 
256 Kbyte 
On-Chip 
Memory 
Snoop 
Control 
Unit 
512 Kbyte L2 Cache 
General Interrupt 
Controller 
Watchdog 
Timer 
AMBA Interconnect AMBA Interconnect 
On a single chip: 
▶ Dual-core ARM Cortex A9 CPU@650 MHz. 
▶ Artix-7 or Kintex-7 - type FPGA logic fabric. 
▶ Can run Linux and Android. 
▶ Realistic target for SoC implementations. 
▶ Full devkit under $200. 
We Study: 
▶ Hardware assisted implementations vs. 
software vs. hardware implementations. 
▶ FPGA and software footprint, speed, power. 
▶ Integration in applications e.g. via OpenSSL. 
7/17
Implementation 1: Keyak (SHA3 Keccak) Core 
k1600_1 
O[63:0] 
O[63:0] 
O[63:0] 
O[63:0] 
O[63:0] 
O[63:0] 
O[63:0] 
O[63:0] 
O[63:0] 
O[63:0] 
O[63:0] 
O[63:0] 
O[63:0] 
O[63:0] 
O[63:0] 
O[63:0] 
O[63:0] 
O[63:0] 
O[63:0] 
O[63:0] 
O[63:0] 
O[63:0] 
O[63:0] 
I0[63:0] 
I1[63:0] t1[23]_i 
O[63:0] 
I0[63:0] 
I1[63:0] t1[24]_i 
O[63:0] 
A[4:0] O[63:0] 
k1600 
clk 
in[1599:0] 
rnd[4:0] 
out[1599:0] 
keccak_rc_i 
RTL_ROM 
tp[0]_i 
O[63:0] 
RTL_XOR 
I0[63:0] 
I1[63:0] 
tp[0]0_i 
O[63:0] 
RTL_XOR 
I0[63:0] 
I1[63:0] 
tp[0]1_i 
O[63:0] 
RTL_XOR 
I0[63:0] 
I1[63:0] 
tp[0]2_i 
O[63:0] 
RTL_XOR 
I0[63:0] 
I1[63:0] 
tp[1]_i 
O[63:0] 
RTL_XOR 
I0[63:0] 
I1[63:0] 
tp[1]0_i 
O[63:0] 
RTL_XOR 
I0[63:0] 
I1[63:0] 
tp[1]1_i 
O[63:0] 
RTL_XOR 
I0[63:0] 
I1[63:0] 
tp[1]2_i 
O[63:0] 
RTL_XOR 
I0[63:0] 
I1[63:0] 
tp[2]_i 
O[63:0] 
RTL_XOR 
I0[63:0] 
I1[63:0] 
tp[2]0_i 
O[63:0] 
RTL_XOR 
I0[63:0] 
I1[63:0] 
tp[2]1_i 
O[63:0] 
RTL_XOR 
I0[63:0] 
I1[63:0] 
tp[2]2_i 
O[63:0] 
RTL_XOR 
I0[63:0] 
I1[63:0] 
tp[3]_i 
O[63:0] 
RTL_XOR 
I0[63:0] 
I1[63:0] 
tp[3]0_i 
O[63:0] 
RTL_XOR 
I0[63:0] 
I1[63:0] 
tp[3]1_i 
O[63:0] 
RTL_XOR 
I0[63:0] 
I1[63:0] 
tp[3]2_i 
O[63:0] 
RTL_XOR 
I0[63:0] 
I1[63:0] 
tp[4]_i 
O[63:0] 
RTL_XOR 
I0[63:0] 
I1[63:0] 
tp[4]0_i 
O[63:0] 
RTL_XOR 
I0[63:0] 
I1[63:0] 
tp[4]1_i 
O[63:0] 
RTL_XOR 
I0[63:0] 
I1[63:0] 
tp[4]2_i 
O[63:0] 
RTL_XOR 
I0[63:0] 
I1[63:0] 
t1[0]_i 
RTL_XOR 
I0[63:0] 
I1[63:0] 
t1[0]0_i 
O[63:0] 
RTL_XOR 
I0[63:0] 
I1[63:0] 
t1[1]_i 
RTL_XOR 
I0[63:0] 
I1[63:0] 
t1[1]0_i 
O[63:0] 
RTL_XOR 
I0[63:0] 
I1[63:0] 
t1[2]_i 
RTL_XOR 
I0[63:0] 
I1[63:0] 
t1[2]0_i 
O[63:0] 
RTL_XOR 
I0[63:0] 
I1[63:0] 
t1[3]_i 
RTL_XOR 
I0[63:0] 
I1[63:0] 
t1[3]0_i 
O[63:0] 
RTL_XOR 
I0[63:0] 
I1[63:0] 
t1[4]_i 
RTL_XOR 
I0[63:0] 
I1[63:0] 
t1[4]0_i 
O[63:0] 
RTL_XOR 
I0[63:0] 
I1[63:0] 
t1[5]_i 
RTL_XOR 
I0[63:0] 
I1[63:0] 
t1[5]0_i 
O[63:0] 
RTL_XOR 
I0[63:0] 
I1[63:0] 
t1[6]_i 
RTL_XOR 
I0[63:0] 
I1[63:0] 
t1[6]0_i 
O[63:0] 
RTL_XOR 
I0[63:0] 
I1[63:0] 
t1[7]_i 
RTL_XOR 
I0[63:0] 
I1[63:0] 
t1[7]0_i 
O[63:0] 
RTL_XOR 
I0[63:0] 
I1[63:0] 
t1[8]_i 
RTL_XOR 
I0[63:0] 
I1[63:0] 
t1[8]0_i 
O[63:0] 
RTL_XOR 
I0[63:0] 
I1[63:0] 
t1[9]_i 
RTL_XOR 
I0[63:0] 
I1[63:0] 
t1[9]0_i 
O[63:0] 
RTL_XOR 
I0[63:0] 
I1[63:0] 
t1[10]_i 
RTL_XOR 
I0[63:0] 
I1[63:0] 
t1[10]0_i 
O[63:0] 
RTL_XOR 
I0[63:0] 
I1[63:0] 
t1[11]_i 
RTL_XOR 
I0[63:0] 
I1[63:0] 
t1[11]0_i 
O[63:0] 
RTL_XOR 
I0[63:0] 
I1[63:0] 
t1[12]_i 
RTL_XOR 
I0[63:0] 
I1[63:0] 
t1[12]0_i 
O[63:0] 
RTL_XOR 
I0[63:0] 
I1[63:0] 
t1[13]_i 
RTL_XOR 
I0[63:0] 
I1[63:0] 
t1[13]0_i 
O[63:0] 
RTL_XOR 
I0[63:0] 
I1[63:0] 
t1[14]_i 
RTL_XOR 
I0[63:0] 
I1[63:0] 
t1[14]0_i 
O[63:0] 
RTL_XOR 
I0[63:0] 
I1[63:0] 
t1[15]_i 
RTL_XOR 
I0[63:0] 
I1[63:0] 
t1[15]0_i 
O[63:0] 
RTL_XOR 
I0[63:0] 
I1[63:0] 
t1[16]_i 
RTL_XOR 
I0[63:0] 
I1[63:0] 
t1[16]0_i 
O[63:0] 
RTL_XOR 
I0[63:0] 
I1[63:0] 
t1[17]_i 
RTL_XOR 
I0[63:0] 
I1[63:0] 
t1[17]0_i 
O[63:0] 
RTL_XOR 
I0[63:0] 
I1[63:0] 
t1[18]_i 
RTL_XOR 
I0[63:0] 
I1[63:0] 
t1[18]0_i 
O[63:0] 
RTL_XOR 
I0[63:0] 
I1[63:0] 
t1[19]_i 
RTL_XOR 
I0[63:0] 
I1[63:0] 
t1[19]0_i 
O[63:0] 
RTL_XOR 
I0[63:0] 
I1[63:0] 
t1[20]_i 
RTL_XOR 
I0[63:0] 
I1[63:0] 
t1[20]0_i 
O[63:0] 
RTL_XOR 
I0[63:0] 
I1[63:0] 
t1[21]_i 
RTL_XOR 
I0[63:0] 
I1[63:0] 
t1[21]0_i 
O[63:0] 
RTL_XOR 
I0[63:0] 
I1[63:0] 
t1[22]_i 
RTL_XOR 
I0[63:0] 
I1[63:0] 
t1[22]0_i 
O[63:0] 
RTL_XOR 
RTL_XOR 
I0[63:0] 
I1[63:0] 
t1[23]0_i 
O[63:0] 
RTL_XOR 
RTL_XOR 
I0[63:0] 
I1[63:0] 
t1[24]0_i 
O[63:0] 
RTL_XOR 
I0[63:0] 
I1[63:0] 
t3[0]_i 
O[63:0] 
RTL_XOR 
I0[63:0] 
I1[63:0] 
O[63:0] 
O[63:0] 
O[63:0] 
O[63:0] 
t3[0]0_i 
O[63:0] 
RTL_XOR 
I0[63:0] 
I1[63:0] 
t3[0]0_i__0 
RTL_AND 
I1[63:0] 
I0[63:0] 
t3[1]_i 
O[63:0] 
RTL_XOR 
I0[63:0] 
I1[63:0] 
t3[1]0_i 
RTL_AND 
I1[63:0] 
I0[63:0] 
t3[2]_i 
O[63:0] 
RTL_XOR 
I0[63:0] 
I1[63:0] 
t3[2]0_i 
RTL_AND 
I1[63:0] 
I0[63:0] 
t3[3]_i 
O[63:0] 
RTL_XOR 
I0[63:0] 
I1[63:0] 
t3[3]0_i 
RTL_AND 
I1[63:0] 
I0[63:0] 
t3[4]_i 
O[63:0] 
RTL_XOR 
I0[63:0] 
I1[63:0] 
t3[4]0_i 
O[63:0] 
RTL_AND 
I1[63:0] 
I0[63:0] 
t3[5]_i 
O[63:0] 
RTL_XOR 
I0[63:0] 
I1[63:0] 
t3[5]0_i 
O[63:0] 
RTL_AND 
I1[63:0] 
I0[63:0] 
t3[6]_i 
O[63:0] 
RTL_XOR 
I0[63:0] 
I1[63:0] 
t3[6]0_i 
O[63:0] 
RTL_AND 
I1[63:0] 
I0[63:0] 
t3[7]_i 
O[63:0] 
RTL_XOR 
I0[63:0] 
I1[63:0] 
t3[7]0_i 
O[63:0] 
RTL_AND 
I1[63:0] 
I0[63:0] 
t3[8]_i 
O[63:0] 
RTL_XOR 
I0[63:0] 
I1[63:0] 
t3[8]0_i 
O[63:0] 
RTL_AND 
I1[63:0] 
I0[63:0] 
t3[9]_i 
O[63:0] 
RTL_XOR 
I0[63:0] 
I1[63:0] 
t3[9]0_i 
O[63:0] 
RTL_AND 
I1[63:0] 
I0[63:0] 
t3[10]_i 
O[63:0] 
RTL_XOR 
I0[63:0] 
I1[63:0] 
t3[10]0_i 
O[63:0] 
RTL_AND 
I1[63:0] 
I0[63:0] 
t3[11]_i 
O[63:0] 
RTL_XOR 
I0[63:0] 
t3[11]0_i I1[63:0] 
O[63:0] 
RTL_AND 
I1[63:0] 
I0[63:0] 
t3[12]_i 
O[63:0] 
RTL_XOR 
I0[63:0] 
t3[12]0_i I1[63:0] 
O[63:0] 
RTL_AND 
I1[63:0] 
I0[63:0] 
t3[13]_i 
O[63:0] 
RTL_XOR 
I0[63:0] 
t3[13]0_i I1[63:0] 
O[63:0] 
RTL_AND 
I1[63:0] 
I0[63:0] 
t3[14]_i 
O[63:0] 
RTL_XOR 
I0[63:0] 
t3[14]0_i I1[63:0] 
O[63:0] 
RTL_AND 
I1[63:0] 
I0[63:0] 
t3[15]_i 
O[63:0] 
RTL_XOR 
I0[63:0] 
t3[15]0_i I1[63:0] 
O[63:0] 
RTL_AND 
I1[63:0] 
I0[63:0] 
t3[16]_i 
O[63:0] 
RTL_XOR 
I0[63:0] 
I1[63:0] 
t3[16]0_i 
O[63:0] 
RTL_AND 
I1[63:0] 
I0[63:0] 
t3[17]_i 
O[63:0] 
RTL_XOR 
I0[63:0] 
t3[17]0_i I1[63:0] 
O[63:0] 
RTL_AND 
I1[63:0] 
I0[63:0] 
t3[18]_i 
O[63:0] 
RTL_XOR 
I0[63:0] 
t3[18]0_i I1[63:0] 
O[63:0] 
RTL_AND 
I1[63:0] 
I0[63:0] 
t3[19]_i 
O[63:0] 
RTL_XOR 
I0[63:0] 
t3[19]0_i I1[63:0] 
O[63:0] 
RTL_AND 
I1[63:0] 
I0[63:0] 
t3[20]_i 
O[63:0] 
RTL_XOR 
I0[63:0] 
t3[20]0_i I1[63:0] 
O[63:0] 
RTL_AND 
I1[63:0] 
I0[63:0] 
t3[21]_i 
O[63:0] 
RTL_XOR 
I0[63:0] 
I1[63:0] 
t3[21]0_i 
I1[63:0] 
O[63:0] 
I0[63:0] t3[22]_i 
RTL_AND 
O[63:0] 
RTL_XOR 
I0[63:0] 
I1[63:0] 
t3[22]0_i 
O[63:0] 
RTL_AND 
I1[63:0] 
I0[63:0] 
t3[23]_i 
O[63:0] 
RTL_XOR 
I0[63:0] 
I1[63:0] 
t3[23]0_i 
O[63:0] 
RTL_AND 
I1[63:0] 
I0[63:0] 
t3[24]_i 
O[63:0] 
RTL_XOR 
I0[63:0] 
t3[24]0_i I1[63:0] 
O[63:0] 
RTL_AND 
I1[63:0] 
I0[63:0] 
63:0 
Single round of Keccak/Keyak 1600-bit 
core permutation drawn with 64-bit 
data paths. Mainly XORs visible. 
SHA3 algorithm Keccak and Keyak AEADs: 
▶ Keccak is a sponge hash with a 1600-bit core 
permutation, selected as SHA3 in 2012. 
▶ Designed by Guido Bertoni, Joan Daemen, 
Michaël Peeters, and Gilles Van Assche. 
▶ Same team proposed the Keyak family of 
AEADs that utilize the same permutation in the 
CAESAR project. 
We implemented: 
▶ The 1600-bit core in Verilog for the Artix-7 
FPGA core of Zynq 7000. 
▶ The module can be utilized for both hashing and 
authenticated encryption. 
8/17
Implementation 2: WhirlBob / Whirlpool Core 
WhirlBob Core. 
StriBob, WhirlBob, and Whirlpool: 
▶ StriBob is a CAESAR proposal by Markku-JuhaniO. Saarinen. 
▶ WhirlBob is a 2nd round tweak proposed by M.-J.O. Saarinen and 
Billy Bob Brumley (submitted to INSCRYPT ’14). 
▶ WhirlBob is based on the permutation of the (ISO) Standardized 
Whirlpool 3.0 hash by Paulo Barreto and Vincent Rijmen. 
We implemented: 
▶ The 512-bit, 1 cycle per round core permutation in Verilog for the 
Artix-7 FPGA core of Zynq 7000. 
▶ The module can be utilized for both Whirlpool hashing and 
WhirlBob authenticated encryption. 
9/17
Hardware Performance 
With one “extra” reloading cycle per block, the maximum theoretical throughput of 
these implementations is: 
Parameter WhirlBob Keyak 
Rounds 12 12 
Cycles 13 13 
Rate (bits) 256 1344 
Speed (bit/clk) 19.7 103.4 
Processing speeds are significantly slower when the Keccak core is used in the 
24-round SHA3 hashing mode. Speed ranges from 23.0 (SHA3-512) to 47.5 
(SHA3-224) bits/clock. Whirlpool, in coparison, is slightly faster thatn WhirlBob. 
10/17
CAESAR Software API vs. Hardware API 
A simple C API was specified by the CAESAR secretariat for reference software 
implementations of the first round candidates. 
int crypto_aead_encrypt ( 
uint8_t c , uint64_t  clen , // Ciphertext 
const uint8_t m, uint64_t mlen , // Message 
const uint8_t ad , uint64_t adlen , // Associated Data 
const uint8_t nsec , // ( Secret IV ) 
const uint8_t npub , // Nonce 
const uint8_t k ) ; // Secret Key 
Decryption and integrity verification can be performed with crypto_aead_decrypt(), 
which has an equivalent interface. 
SÆHI utilizes the same software API and a simple memory-mapped hardware API. The 
software side is essentially a driver suitable for bare metal implementation. 
11/17
Proposed Baseline Hardware API 
Our cryptographic coprocessor has a simple, almost universal memory-mapped 
interface. The module or hardware PIN interface is the same as for generic single port 
RAM (with optional interrupt request line). 
Signal Dir Purpose Diagram 
ADDR In Address 
ADDR 
DI 
WE 
EN 
CLK 
AEAD 
Core 
DO 
IRQ 
DI In Data Write 
WE In Write enable 
EN In Enable/Select 
CLK In Clock 
DO Out Data Read 
IRQ Out Interrupt Req. 
The signaling between software component and this API is defined by the driver. 
Faster (DMA, AXI) alternatives can be used – this is just the baseline interface. 
12/17
Comparing Implementations 
Code lines in our WhirlBob (StriBob) and 
Keyak reference implementations: 
Component WhirlBob Keyak 
Interface Verilog 99 114 
Round Verilog 228 129 
Driver C 60 60 
API Interface C 261 250 
Total code 639 553 
Post synthesis and route utilization within 
Artix-7 FPGA fabric of Xilinx Zynq 7010: 
Logic WhirlBob Keyak 
LUTs 3,795 4,574 
Flip-Flops 1,060 3,237 
MUXs 90 159 
Other 1 2 
Total logic 4,946 7,972 
13/17
Implementations 
We first developed the implementations with a homemade VGA module (not utilizing 
CPU at all). The implementations were then integrated into Xillinux and and made 
accessible to user space daemons. 
14/17
What to test? 
We hope to measure for each candidate: 
A Area. FPGA Slices or ASIC Gate Equivalents. 
W Power. Power consumption (Watts = J/s). 
R Speed. Ideal throughput (Bytes/Second). 
One key goal is to maximize 
e = RW 
: 
Note that doubling Afor factor 2 parallelism will approximately double both RandW 
and ewill remain constant. 
The same is true for doubling the clock frequency since power consumption is almost 
linearly dependent on clock frequency for most (CMOS) circuits. 
Hence Bytes/Joule is perhaps the most relevant metric for mobile devices. 
15/17
Integration path for Linux/Android Testing 
System-on-Chip ▶ The dominant underlying API for Linux is based on 
CPU Core KERNEL 
user space processes 
Software SHI 
SHI daemon 
not available 
AEAD Plugin engine 
libmyaead.so 
OpenSSL Crypto API 
libcrypto.so 
TLS API 
libssl.so 
SSH API 
libssh.so 
Browser 
application 
SSH 
application 
utilities 
cmd tools 
ciphers 
protocols 
apps 
interprocess 
communication 
Cipher 
Daemons 
SHI 
AEAD 1 
SHI 
AEAD 2 
OpenSSL: libcrypto, libssl. Supported by browsers etc. 
▶ OpenSSL supports configurable plugin “engines”. 
▶ After recent bugs (heartbleed), new forks: 
▶ Google: BoringSSL. 
▶ OpenBSD group: LibreSSL (upcoming ressl API). 
▶ Since the hardware accelerator is a shared resource, 
implement as an user space daemon. 
▶ Utilize experimental ciphersuite identifiers in 
applications and TLS, SSH, IPSec. Plug-in CAESAR 
ciphers to replace AES-GCM. 
▶ Measure utilization, power, time, throughput with 
realistic usage profiles. 
16/17
Conclusions 
▶ CAESAR is a project to find next-generation Authenticated Encryption algorithms. 
▶ Proposed SÆHI, a simple memory-mapped hardware API for CAESAR ciphers. 
▶ Realistic hardware target: System-on-Chip with FPGA logic and ARM Cortex A9. 
▶ FPGA implementations of Keyak and WhirlBob algorithms. 
▶ Integration path for Applications in Android. 
next.. a little demo! 
17/17

Simple AEAD Hardware Interface SAEHI in a SoC: Implementing an On-Chip Keyak/WhirlBob Coprocessor

  • 1.
    2014 1. “BeyondModes: Building a Secure Record Protocol from a Cryptographic Sponge Permutation.” CT-RSA 2014. LNCS 8366, pp. 270–285, Springer (2014) 2. “CBEAM: Efficient Authenticated Encryption from Feebly One-Way ϕ Functions.” CT-RSA 2014. LNCS 8366, pp. 251–269, Springer (2014) 3. “STRIBOB: Authenticated Encryption from GOST R 34.11-2012 LPS Permutation.” CTCrypt ’14. To appear in Mathematical Aspects of Cryptography, Steklov Mathematical Institute of RAS (2014) 4. “Simple AEAD Hardware Interface (SÆHI) in a SoC: Implementing an On-Chip Keyak/WhirlBob Coprocessor.” TrustED 2014, ACMCCS 2014Workshops, 03 November 2014, Scottsdale AZ US. To appear. ACM(2014) 5. “Lighter, Faster, and Constant-Time: WHIRLBOB, the Whirlpool variant of STRIBOB.” With Billy Bob Brumley. IACR ePrint 2014/501. Submitted (2014) 6. “BRUTUS: Identifying CryptanalyticWeaknesses in CAESAR First Round Candidates.” IACR ePrint 2014/850. Submitted (2014) + Invited Talks. 1/17
  • 2.
    Simple AEAD HardwareInterface (SÆHI) in a SoC: Implementing an On-Chip Keyak/WhirlBob Coprocessor Dr. Markku-JuhaniO. Saarinen mjos@item.ntnu.no NORWEGIAN UNIVERSITY OF SCIENCE AND TECHNOLOGY TrustED ’14 – 03 November 2014 – Scottsdale AZ 2/17
  • 3.
    Authenticated Encryption withAssociated Data An Authenticated Encryption with Associated Data (AEAD) primitive provides: ▶ Encryption or confidentiality / privacy protection, and ▶ Authentication or integrity protection for encrypted and associated data. Preferably in a single pass over the data. Security protocols such as IPSec and SSL/TLS usually required two processing steps for each packet in 1990’s and 200x’s. ▶ Authentication was handled with a HMAC (Hash Message Authentication Code). ▶ Encryption was provided either with block cipher such as 3DES-CBC or AES-CBC or a stream cipher such as RC4. Hardware implementation of such a twin set-up is cumbersome. Transition to AE has been swift during recent years because of ACM-GCM’s status in Suite B (Classified COTS) and many attacks like CRIME, LUCKY13, POODLE. 3/17
  • 4.
    Background: CAESAR projectfor new AEAD algorithms 2014-2017 NIST - sponsored international Competition for Authenticated Encryption: Security, Applicability, and Robustness. http://competitions.cr.yp.to/caesar-call.html ▶ Jan 2013 Announced by Dan Bernstein (secretary) ▶ Mar 2014 Deadline for first-round submissions (57) ▶ May 2014 Deadline for first-round software ▶ Aug 2014 DIAC ’14Workshop, UCSB ▶ Jan 2015 Second round candidates announced ▶ Feb 2015 Second round tweaks (fixes) ▶ Feb 2015 Second round Verilog / VHDL (this talk) ▶ Dec 2015 Third round candidates ▶ Dec 2016 Final round candidates ▶ Dec 2017 Final CAESAR portfolio announcement 4/17
  • 5.
    Hardware API forAuthenticated Encryption CAESAR candidates came in many shapes and sizes. Here’s a rough breakdown: 8 are clearly based on a SHA3-style Sponge construction. 9 are (somehow) constructed from AES components. 19 are AES modes of operation. 21 are based on other design paradigms or are entirely ad hoc. We want consistent testing across second round candidates. Signalling. How to communicate with the hardware ? Can a consistent, high-level “hardware API” be constructed ? Memory access. Some prominent proposals (AEZ and SIV) require two passes over the data, so APIs in the style of hash functions don’t really work. What to test. Realistic test profiles via operating system and application integration. 5/17
  • 6.
    System-on-Chip (SoC) Designs Total global shipments 2014 (million units) 1241.664 314.065 853.829 Android Other mobile PC total Majority of Internet and communication devices are Android Linux - based tablets or smart phones. System-on-Chip (SoC) designs integrate all the necessary components of a computing application on a single chip. Mobile electronics such as (smart) phones and tablets are built on SoCs. Also used in found in Internet-of-Things (IoT) appliances, modems, routers, home media, cars, etc. Security of transmitted and stored data is even more relevant to mobile devices than to traditional PC systems. Limited CPU performance. Energy efficiency critical. Coprocessors: Audio and video codecs, RF processing, 3D display rendering, M7/CCP motion, natural language, etc. !Our evaluation target. 6/17
  • 7.
    Zynq-7000 FPGA Artix7 / ARM Cortex A9 SoC 2x SPI 2x I2C 2x CAN 2x UART GPIO 2x SDIO with DMA 2x USB with DMA 2x GigE with DMA Processing System AMBA® Interconnect AMBA Interconnect ARM®CoreSight™Multi-Core Debug and Trace NEON™DSP/FPUEngine NEONDSP/FPUEngine Cortex ™- A9 MPCore Cortex- A9 MPCore 32/32 KB I/D Caches 32/32 KB I/D Caches EMIO General Purpose ACP XADC 2x ADC, Mux, Thermal Sensor AXI Ports High Performance AXI Ports PCIe Gen2 1-8 Lanes Security AES, SHA, RSA Programmable Logic (System Gates, DSP, RAM) Multi-Standard I/Os (3.3V & High-Speed 1.8V) Multi-Gigabit Transceivers Processor I/O Mux Flash Controller NOR, NAND, SRAM, Quad SPI Multiport DRAM Controller DDR3, DDR3L, DDR2 Configuration Timers DMA 256 Kbyte On-Chip Memory Snoop Control Unit 512 Kbyte L2 Cache General Interrupt Controller Watchdog Timer AMBA Interconnect AMBA Interconnect On a single chip: ▶ Dual-core ARM Cortex A9 CPU@650 MHz. ▶ Artix-7 or Kintex-7 - type FPGA logic fabric. ▶ Can run Linux and Android. ▶ Realistic target for SoC implementations. ▶ Full devkit under $200. We Study: ▶ Hardware assisted implementations vs. software vs. hardware implementations. ▶ FPGA and software footprint, speed, power. ▶ Integration in applications e.g. via OpenSSL. 7/17
  • 8.
    Implementation 1: Keyak(SHA3 Keccak) Core k1600_1 O[63:0] O[63:0] O[63:0] O[63:0] O[63:0] O[63:0] O[63:0] O[63:0] O[63:0] O[63:0] O[63:0] O[63:0] O[63:0] O[63:0] O[63:0] O[63:0] O[63:0] O[63:0] O[63:0] O[63:0] O[63:0] O[63:0] O[63:0] I0[63:0] I1[63:0] t1[23]_i O[63:0] I0[63:0] I1[63:0] t1[24]_i O[63:0] A[4:0] O[63:0] k1600 clk in[1599:0] rnd[4:0] out[1599:0] keccak_rc_i RTL_ROM tp[0]_i O[63:0] RTL_XOR I0[63:0] I1[63:0] tp[0]0_i O[63:0] RTL_XOR I0[63:0] I1[63:0] tp[0]1_i O[63:0] RTL_XOR I0[63:0] I1[63:0] tp[0]2_i O[63:0] RTL_XOR I0[63:0] I1[63:0] tp[1]_i O[63:0] RTL_XOR I0[63:0] I1[63:0] tp[1]0_i O[63:0] RTL_XOR I0[63:0] I1[63:0] tp[1]1_i O[63:0] RTL_XOR I0[63:0] I1[63:0] tp[1]2_i O[63:0] RTL_XOR I0[63:0] I1[63:0] tp[2]_i O[63:0] RTL_XOR I0[63:0] I1[63:0] tp[2]0_i O[63:0] RTL_XOR I0[63:0] I1[63:0] tp[2]1_i O[63:0] RTL_XOR I0[63:0] I1[63:0] tp[2]2_i O[63:0] RTL_XOR I0[63:0] I1[63:0] tp[3]_i O[63:0] RTL_XOR I0[63:0] I1[63:0] tp[3]0_i O[63:0] RTL_XOR I0[63:0] I1[63:0] tp[3]1_i O[63:0] RTL_XOR I0[63:0] I1[63:0] tp[3]2_i O[63:0] RTL_XOR I0[63:0] I1[63:0] tp[4]_i O[63:0] RTL_XOR I0[63:0] I1[63:0] tp[4]0_i O[63:0] RTL_XOR I0[63:0] I1[63:0] tp[4]1_i O[63:0] RTL_XOR I0[63:0] I1[63:0] tp[4]2_i O[63:0] RTL_XOR I0[63:0] I1[63:0] t1[0]_i RTL_XOR I0[63:0] I1[63:0] t1[0]0_i O[63:0] RTL_XOR I0[63:0] I1[63:0] t1[1]_i RTL_XOR I0[63:0] I1[63:0] t1[1]0_i O[63:0] RTL_XOR I0[63:0] I1[63:0] t1[2]_i RTL_XOR I0[63:0] I1[63:0] t1[2]0_i O[63:0] RTL_XOR I0[63:0] I1[63:0] t1[3]_i RTL_XOR I0[63:0] I1[63:0] t1[3]0_i O[63:0] RTL_XOR I0[63:0] I1[63:0] t1[4]_i RTL_XOR I0[63:0] I1[63:0] t1[4]0_i O[63:0] RTL_XOR I0[63:0] I1[63:0] t1[5]_i RTL_XOR I0[63:0] I1[63:0] t1[5]0_i O[63:0] RTL_XOR I0[63:0] I1[63:0] t1[6]_i RTL_XOR I0[63:0] I1[63:0] t1[6]0_i O[63:0] RTL_XOR I0[63:0] I1[63:0] t1[7]_i RTL_XOR I0[63:0] I1[63:0] t1[7]0_i O[63:0] RTL_XOR I0[63:0] I1[63:0] t1[8]_i RTL_XOR I0[63:0] I1[63:0] t1[8]0_i O[63:0] RTL_XOR I0[63:0] I1[63:0] t1[9]_i RTL_XOR I0[63:0] I1[63:0] t1[9]0_i O[63:0] RTL_XOR I0[63:0] I1[63:0] t1[10]_i RTL_XOR I0[63:0] I1[63:0] t1[10]0_i O[63:0] RTL_XOR I0[63:0] I1[63:0] t1[11]_i RTL_XOR I0[63:0] I1[63:0] t1[11]0_i O[63:0] RTL_XOR I0[63:0] I1[63:0] t1[12]_i RTL_XOR I0[63:0] I1[63:0] t1[12]0_i O[63:0] RTL_XOR I0[63:0] I1[63:0] t1[13]_i RTL_XOR I0[63:0] I1[63:0] t1[13]0_i O[63:0] RTL_XOR I0[63:0] I1[63:0] t1[14]_i RTL_XOR I0[63:0] I1[63:0] t1[14]0_i O[63:0] RTL_XOR I0[63:0] I1[63:0] t1[15]_i RTL_XOR I0[63:0] I1[63:0] t1[15]0_i O[63:0] RTL_XOR I0[63:0] I1[63:0] t1[16]_i RTL_XOR I0[63:0] I1[63:0] t1[16]0_i O[63:0] RTL_XOR I0[63:0] I1[63:0] t1[17]_i RTL_XOR I0[63:0] I1[63:0] t1[17]0_i O[63:0] RTL_XOR I0[63:0] I1[63:0] t1[18]_i RTL_XOR I0[63:0] I1[63:0] t1[18]0_i O[63:0] RTL_XOR I0[63:0] I1[63:0] t1[19]_i RTL_XOR I0[63:0] I1[63:0] t1[19]0_i O[63:0] RTL_XOR I0[63:0] I1[63:0] t1[20]_i RTL_XOR I0[63:0] I1[63:0] t1[20]0_i O[63:0] RTL_XOR I0[63:0] I1[63:0] t1[21]_i RTL_XOR I0[63:0] I1[63:0] t1[21]0_i O[63:0] RTL_XOR I0[63:0] I1[63:0] t1[22]_i RTL_XOR I0[63:0] I1[63:0] t1[22]0_i O[63:0] RTL_XOR RTL_XOR I0[63:0] I1[63:0] t1[23]0_i O[63:0] RTL_XOR RTL_XOR I0[63:0] I1[63:0] t1[24]0_i O[63:0] RTL_XOR I0[63:0] I1[63:0] t3[0]_i O[63:0] RTL_XOR I0[63:0] I1[63:0] O[63:0] O[63:0] O[63:0] O[63:0] t3[0]0_i O[63:0] RTL_XOR I0[63:0] I1[63:0] t3[0]0_i__0 RTL_AND I1[63:0] I0[63:0] t3[1]_i O[63:0] RTL_XOR I0[63:0] I1[63:0] t3[1]0_i RTL_AND I1[63:0] I0[63:0] t3[2]_i O[63:0] RTL_XOR I0[63:0] I1[63:0] t3[2]0_i RTL_AND I1[63:0] I0[63:0] t3[3]_i O[63:0] RTL_XOR I0[63:0] I1[63:0] t3[3]0_i RTL_AND I1[63:0] I0[63:0] t3[4]_i O[63:0] RTL_XOR I0[63:0] I1[63:0] t3[4]0_i O[63:0] RTL_AND I1[63:0] I0[63:0] t3[5]_i O[63:0] RTL_XOR I0[63:0] I1[63:0] t3[5]0_i O[63:0] RTL_AND I1[63:0] I0[63:0] t3[6]_i O[63:0] RTL_XOR I0[63:0] I1[63:0] t3[6]0_i O[63:0] RTL_AND I1[63:0] I0[63:0] t3[7]_i O[63:0] RTL_XOR I0[63:0] I1[63:0] t3[7]0_i O[63:0] RTL_AND I1[63:0] I0[63:0] t3[8]_i O[63:0] RTL_XOR I0[63:0] I1[63:0] t3[8]0_i O[63:0] RTL_AND I1[63:0] I0[63:0] t3[9]_i O[63:0] RTL_XOR I0[63:0] I1[63:0] t3[9]0_i O[63:0] RTL_AND I1[63:0] I0[63:0] t3[10]_i O[63:0] RTL_XOR I0[63:0] I1[63:0] t3[10]0_i O[63:0] RTL_AND I1[63:0] I0[63:0] t3[11]_i O[63:0] RTL_XOR I0[63:0] t3[11]0_i I1[63:0] O[63:0] RTL_AND I1[63:0] I0[63:0] t3[12]_i O[63:0] RTL_XOR I0[63:0] t3[12]0_i I1[63:0] O[63:0] RTL_AND I1[63:0] I0[63:0] t3[13]_i O[63:0] RTL_XOR I0[63:0] t3[13]0_i I1[63:0] O[63:0] RTL_AND I1[63:0] I0[63:0] t3[14]_i O[63:0] RTL_XOR I0[63:0] t3[14]0_i I1[63:0] O[63:0] RTL_AND I1[63:0] I0[63:0] t3[15]_i O[63:0] RTL_XOR I0[63:0] t3[15]0_i I1[63:0] O[63:0] RTL_AND I1[63:0] I0[63:0] t3[16]_i O[63:0] RTL_XOR I0[63:0] I1[63:0] t3[16]0_i O[63:0] RTL_AND I1[63:0] I0[63:0] t3[17]_i O[63:0] RTL_XOR I0[63:0] t3[17]0_i I1[63:0] O[63:0] RTL_AND I1[63:0] I0[63:0] t3[18]_i O[63:0] RTL_XOR I0[63:0] t3[18]0_i I1[63:0] O[63:0] RTL_AND I1[63:0] I0[63:0] t3[19]_i O[63:0] RTL_XOR I0[63:0] t3[19]0_i I1[63:0] O[63:0] RTL_AND I1[63:0] I0[63:0] t3[20]_i O[63:0] RTL_XOR I0[63:0] t3[20]0_i I1[63:0] O[63:0] RTL_AND I1[63:0] I0[63:0] t3[21]_i O[63:0] RTL_XOR I0[63:0] I1[63:0] t3[21]0_i I1[63:0] O[63:0] I0[63:0] t3[22]_i RTL_AND O[63:0] RTL_XOR I0[63:0] I1[63:0] t3[22]0_i O[63:0] RTL_AND I1[63:0] I0[63:0] t3[23]_i O[63:0] RTL_XOR I0[63:0] I1[63:0] t3[23]0_i O[63:0] RTL_AND I1[63:0] I0[63:0] t3[24]_i O[63:0] RTL_XOR I0[63:0] t3[24]0_i I1[63:0] O[63:0] RTL_AND I1[63:0] I0[63:0] 63:0 Single round of Keccak/Keyak 1600-bit core permutation drawn with 64-bit data paths. Mainly XORs visible. SHA3 algorithm Keccak and Keyak AEADs: ▶ Keccak is a sponge hash with a 1600-bit core permutation, selected as SHA3 in 2012. ▶ Designed by Guido Bertoni, Joan Daemen, Michaël Peeters, and Gilles Van Assche. ▶ Same team proposed the Keyak family of AEADs that utilize the same permutation in the CAESAR project. We implemented: ▶ The 1600-bit core in Verilog for the Artix-7 FPGA core of Zynq 7000. ▶ The module can be utilized for both hashing and authenticated encryption. 8/17
  • 9.
    Implementation 2: WhirlBob/ Whirlpool Core WhirlBob Core. StriBob, WhirlBob, and Whirlpool: ▶ StriBob is a CAESAR proposal by Markku-JuhaniO. Saarinen. ▶ WhirlBob is a 2nd round tweak proposed by M.-J.O. Saarinen and Billy Bob Brumley (submitted to INSCRYPT ’14). ▶ WhirlBob is based on the permutation of the (ISO) Standardized Whirlpool 3.0 hash by Paulo Barreto and Vincent Rijmen. We implemented: ▶ The 512-bit, 1 cycle per round core permutation in Verilog for the Artix-7 FPGA core of Zynq 7000. ▶ The module can be utilized for both Whirlpool hashing and WhirlBob authenticated encryption. 9/17
  • 10.
    Hardware Performance Withone “extra” reloading cycle per block, the maximum theoretical throughput of these implementations is: Parameter WhirlBob Keyak Rounds 12 12 Cycles 13 13 Rate (bits) 256 1344 Speed (bit/clk) 19.7 103.4 Processing speeds are significantly slower when the Keccak core is used in the 24-round SHA3 hashing mode. Speed ranges from 23.0 (SHA3-512) to 47.5 (SHA3-224) bits/clock. Whirlpool, in coparison, is slightly faster thatn WhirlBob. 10/17
  • 11.
    CAESAR Software APIvs. Hardware API A simple C API was specified by the CAESAR secretariat for reference software implementations of the first round candidates. int crypto_aead_encrypt ( uint8_t c , uint64_t clen , // Ciphertext const uint8_t m, uint64_t mlen , // Message const uint8_t ad , uint64_t adlen , // Associated Data const uint8_t nsec , // ( Secret IV ) const uint8_t npub , // Nonce const uint8_t k ) ; // Secret Key Decryption and integrity verification can be performed with crypto_aead_decrypt(), which has an equivalent interface. SÆHI utilizes the same software API and a simple memory-mapped hardware API. The software side is essentially a driver suitable for bare metal implementation. 11/17
  • 12.
    Proposed Baseline HardwareAPI Our cryptographic coprocessor has a simple, almost universal memory-mapped interface. The module or hardware PIN interface is the same as for generic single port RAM (with optional interrupt request line). Signal Dir Purpose Diagram ADDR In Address ADDR DI WE EN CLK AEAD Core DO IRQ DI In Data Write WE In Write enable EN In Enable/Select CLK In Clock DO Out Data Read IRQ Out Interrupt Req. The signaling between software component and this API is defined by the driver. Faster (DMA, AXI) alternatives can be used – this is just the baseline interface. 12/17
  • 13.
    Comparing Implementations Codelines in our WhirlBob (StriBob) and Keyak reference implementations: Component WhirlBob Keyak Interface Verilog 99 114 Round Verilog 228 129 Driver C 60 60 API Interface C 261 250 Total code 639 553 Post synthesis and route utilization within Artix-7 FPGA fabric of Xilinx Zynq 7010: Logic WhirlBob Keyak LUTs 3,795 4,574 Flip-Flops 1,060 3,237 MUXs 90 159 Other 1 2 Total logic 4,946 7,972 13/17
  • 14.
    Implementations We firstdeveloped the implementations with a homemade VGA module (not utilizing CPU at all). The implementations were then integrated into Xillinux and and made accessible to user space daemons. 14/17
  • 15.
    What to test? We hope to measure for each candidate: A Area. FPGA Slices or ASIC Gate Equivalents. W Power. Power consumption (Watts = J/s). R Speed. Ideal throughput (Bytes/Second). One key goal is to maximize e = RW : Note that doubling Afor factor 2 parallelism will approximately double both RandW and ewill remain constant. The same is true for doubling the clock frequency since power consumption is almost linearly dependent on clock frequency for most (CMOS) circuits. Hence Bytes/Joule is perhaps the most relevant metric for mobile devices. 15/17
  • 16.
    Integration path forLinux/Android Testing System-on-Chip ▶ The dominant underlying API for Linux is based on CPU Core KERNEL user space processes Software SHI SHI daemon not available AEAD Plugin engine libmyaead.so OpenSSL Crypto API libcrypto.so TLS API libssl.so SSH API libssh.so Browser application SSH application utilities cmd tools ciphers protocols apps interprocess communication Cipher Daemons SHI AEAD 1 SHI AEAD 2 OpenSSL: libcrypto, libssl. Supported by browsers etc. ▶ OpenSSL supports configurable plugin “engines”. ▶ After recent bugs (heartbleed), new forks: ▶ Google: BoringSSL. ▶ OpenBSD group: LibreSSL (upcoming ressl API). ▶ Since the hardware accelerator is a shared resource, implement as an user space daemon. ▶ Utilize experimental ciphersuite identifiers in applications and TLS, SSH, IPSec. Plug-in CAESAR ciphers to replace AES-GCM. ▶ Measure utilization, power, time, throughput with realistic usage profiles. 16/17
  • 17.
    Conclusions ▶ CAESARis a project to find next-generation Authenticated Encryption algorithms. ▶ Proposed SÆHI, a simple memory-mapped hardware API for CAESAR ciphers. ▶ Realistic hardware target: System-on-Chip with FPGA logic and ARM Cortex A9. ▶ FPGA implementations of Keyak and WhirlBob algorithms. ▶ Integration path for Applications in Android. next.. a little demo! 17/17