High Speed and Throughput Evaluation of SHA-1 Hash Function Design with Pipelining and Unfolding Transformation Techniques

Shamsiah binti Suhaili¹, Takahiro Watanabe² and Norhuzaimin Julai¹
¹Faculty of Engineering, Universiti Malaysia Sarawak, 94300 Kota Samarahan.
²Graduate School of Information, Production and System, Waseda University, Kitakyushu-shi, Fukuoka, 808-0135 Japan.
sushamsiah@unimas.my

Abstract—In recent years, designing of SHA-1 hash function has become popular because it was important in security design application. One of the applications of SHA-1 hash function was HMAC where the architecture of SHA-1 needed to be improved in terms of speed and throughput in order to obtain the high-performance design. The objective of this project was to design high speed and throughput evaluation of SHA-1 hash function based on a combination of pipelining and unfolding techniques. By using both techniques in designing the architecture of SHA-1 design, the speed of SHA-1 hash function can be increased significantly as well as throughput of the design. In this paper, five proposed SHA-1 architectures were designed with different stages of pipelining such as 1, 4 and 40 stages. The results showed the high-speed design of SHA-1 design can be obtained by using 40 stages pipelining with unfolding factor two. This design provided a high-speed implementation with maximum frequency of 308.17 MHz on Arria II GX and 458.59 MHz on Virtex 5 XC5VLX50T. Furthermore, the throughput of the design also increased about 150.269 Gbps and 223.618 Gbps on Arria II GX and Virtex 5 XC5VLX50T respectively. Thus, high-speed design of SHA-1 hash function was successfully obtained which can give benefit to society especially in security system data transmission and other types of hash functions.

Index Terms—FPGA; Pipelining; SHA-1 Hash Function; Unfolding Transformation.

I. INTRODUCTION

Cryptography is the the science of writing messages in secret codes; no one can read an encrypted message except the intended receiver. There are three types of cryptographic algorithms: symmetric cryptographic (secret key), asymmetric cryptographic (public key), and hash function. Cryptographic hash function or known as message digest algorithm is an algorithm that translates a random string of characters into hash code output [1]. Hash function is widely used for cryptographic application and it is useful for message authentication because it is strong enough against collision. There are different types of hash function such as SHA-1, MD5, RIPEMD160, and others. This paper focuses on cryptographic SHA-1 hash function because it is simple and is widely used in digital signature applications, Keyed-hash Message Authentication Codes (HMACs) and Hash-based Key Derivation Functions (Hash-based KDFs) [2]. Efficient implementation of cryptographic hash function on reconfigurable hardware is one of the problems that need to be solved. Thus, high performance of hash function design in terms of speed, and area is important to improve the throughput of the hash function design since nowadays all the security system need fast implementation. Maximum frequency needs to be considered in designing hash function in order to increase the speed of the design. Besides, efficient hash function also has small area implementation. Thus, in order to improve the speed and area, this paper proposes the architecture of SHA-1 40 stages pipelining with unfolding factor 2 transformations. Besides, this method can also be implemented to other hash functions to improve the speed of the design.

There are many researchers have been designed SHA-1 algorithm based on FPGA. Some of them tried to increase the frequency and throughput of the SHA-1 design [3-14]. J. Kim [3] proposed unfolding transformation with 4 stages pipelining whose throughput was 7.4 Mbps, E. H. Lee [10] and H. Michail [12] proposed SHA-1 design whose throughputs were 5.9 Gbps and 4.7 Gbps respectively with 4 stages pipelining. L. Jiang [5] improved the throughput by applying the unfolding transformation with 40 stages pipelining. This structure gave 76.2 Gbps of throughput but large area implementation which was about 32048 total registers and 33764 ALUTs. Thus, this paper proposed a combination of pipelining and unfolding transformation to improve the performance of SHA-1 hash function.

II. FUNDAMENTALS OF SHA-1 ALGORITHM

SHA-1 hash function algorithm takes arbitrary input message with the length of the message less than $2^{64}$ bits. The output of hash function is called message digest or hash code. SHA-1 hash function produces 160 bits output of hash code with the processing of 512 bits message inputs and 160 bits initial value. There are several steps need to be considered in order to produce 160-bit SHA-1 hash output. First, the message needs to be padded, so its length is congruent to 448 modulo 512. After that, the input message will be followed by single 1-bit input at the end of the message, and then 0 bits are padded until the last 64-bit which is the length of the message. After padding the message, the inputs of 64 bits are appended to the message. Therefore, overall message padding is 512 bits together with the length of the message. Table 1 shows buffer initialization of SHA-1 hash function in hexadecimal. There are five different inputs buffer initialization such as A, B, C, D, and E. These inputs are fixed for any kind of SHA-1 hash function.

After initialization process, the next step is to process the 512-bit input message which is in 16 blocks with 32-bit per block. Figure 1 illustrates 80 steps of message processing
with 20 steps for four different non-linear function of SHA-1 hash function. Each step function has the same structure but different non-linear function, $f$, constant value, $K$ and message, $W$. Table 2 and Table 3 show the non-linear function and constant value, $K$ of SHA-1 hash function respectively.

$ROTL^n(b)$ denotes rotation to the left of $b$ by a position. Equation (2) provides the output for five different registers $A$, $B$, $C$, $D$, and $E$. Figure 2 illustrates the compression function of SHA-1 algorithm. The output $A$ is the crucial part where it consists of non-linear function, input message, $W$, constant, $K$, input $E$ and shift $A$ to the left by 5. Finally, buffer initialization input will be added to the last output rounds to produce message digest of SHA-1 hash function.

$$W_i = ROTL_4(W_{i-3} \oplus W_{i-4} \oplus W_{i-14} \oplus W_{i-16})$$  \hspace{1cm} (1)

$$T = (A \ll 5) + f(B,C,D) + W_i + K_i + E$$

$$A = T, B = A, C = B \ll 30, D = C, E = D$$ \hspace{1cm} (2)

**III. PROPOSED SHA-1 DESIGN**

In this project, iterative and pipelining techniques are applied to SHA-1 hash function algorithm in order to increase the speed of the design [15]. Several methods have been proposed such as iterative, inner-round pipelining and outer-round pipelining. In order to improve the throughput of SHA-1 design, unfolding transformation technique was proposed [16]. Unfolding algorithm is one of the techniques that can be used by DSP application to obtain a new program that performs more than one iteration of the original program. The number of iterations is introduced as unfolding factor, $J$. By applying these techniques to SHA-1 architecture, the high performance of SHA-1 hash function can be obtained.

There are five proposed SHA-1 to be designed and evaluated as follows: (1) iterative SHA-1, (2) inner-round pipelining SHA-1, (3) 4 stages pipelining SHA-1, (4) 40 stages pipelining SHA-1 and (5) 40 stages pipelining with unfolding transformation factor 2 SHA-1. These architectures are designed using Verilog code and synthesized with both Altera and Xilinx tools. Figure 3 shows the architecture of 4 stages pipelining with five different inputs. This figure can be used to construct the architecture of 40 stages pipelining with unfolding factor two where each stage consists of two rounds. Some modification has been made in terms of input and output of the modules to obtain the output of specific proposed design. This design starts with buffer initialization which is init module. In this module, there are five different
fixed values of buffer initialization values as mentioned in Table 1. In order to execute schedule module, 80 rounds of counter module need to be designed. Schedule module is a module that executes message input to step function. The message input for the first 15 data input of SHA-1 comes from 512-bit message input which has been divided into 15 blocks of 32-bit message input. In this schedule module, the message after 15 rounds will be calculated using the formula given in Equation (1).

\[
\begin{align*}
\text{Temp} & = \text{ROTL}^5(A_t) + \text{func}(B_t, C_t, D_t) + E_t + W_t + K_t \\
A_{t+2} & = \text{ROTL}^5(\text{Temp}) + \\
f_{t+1} & = (A_t, \text{ROTL}^{30}(B_t), C_t) + D_t + W_{t+1} + K_{t+1} \\
B_{t+2} & = \text{Temp} \\
C_{t+2} & = \text{ROTL}^{30}(A_t) \\
D_{t+2} & = \text{ROTL}^{30}(B_t) \\
E_{t+2} & = C_t
\end{align*}
\]  

\[ (3) \]

A. Compression Function of Unfolding SHA-1 Algorithm

The main objective of this project is to obtain high-speed design. Therefore, four and forty stages of pipelining SHA-1 design are proposed. By adding registers at different stages of round can improve the maximum frequency of the design. Four registers are applied to A, B, C, D, and E of compression function with 20 rounds for each stage and forty registers are applied to A, B, C, D, and E of compression function with two rounds for each stage. In order to improve the throughput of SHA-1 design, unfolding transformation technique has been proposed. In this paper, the unfolding technique is designed based on factor 2. Figure 4 illustrates the structure of compression function of SHA-1 algorithm using unfolding transformation technique. From this figure, there are two non-linear functions such as Func and Func_c, with three different inputs. Furthermore, 8 addition operations are performed in parallel form in this architecture. It consists of four shifters such as circular left shift of A by 30, circular left shift of B by 30, circular left shift of A by 5 and circular left shift of Temp by 5. This process reduces the number of processing cycle of SHA-1 design from 80 cycles to 40 cycles where two hash operations are executed per cycle. Small number of latency can increase the throughput of SHA-1 design. Hence, efficient and high-performance design can be obtained. The outputs of SHA-1 unfolding algorithms are shown in the following equation. \( \text{ROTL}^a(b) \) represents circular left shift or left rotation operation of \( b \) by \( a \) position to the left, and \( \text{func}(p, q, r) \) means non-linear function at time \( t \) for three different input \( p, q \) and \( r \).

The five SHA-1 designs (1) iterative, (2) inner pipelining, (3) 4 stages pipelining, (4) 40 stages pipelining, and (5) 40 stages pipelining with unfolding factor 2 transformation were successfully designed using Verilog code. These designs were synthesized and implemented using both Altera Quartus II 15.0 and Xilinx ISE design suite 14.7. The proposed SHA-1 designs were simulated using ModelSim for both functional and timing simulation to verify the correctness of the design. TimeQuest timing analyzer was used to optimize the design by giving appropriate clock constraint of sdc (synopsys design constraint) file in order to meet the timing requirement. Table 4 and Table 5 show the five proposed SHA-1 design results obtained on Arria II GX and Virtex 5 XC5VLX50T after synthesis and implementation process. The following formula can be used to calculate the throughput of SHA-1 design. \( p \) denotes the number of pipeline stages in the SHA-1 design.

\[
\text{Throughput} = \frac{512 \times \text{Maximum Frequency} \times p}{\text{Latency}}
\]  

\[ (4) \]

The novelty of this design is using combination method between unfolding and the number of pipelining stages as shown in Table 4 and 5. Table 4 shows the result for...
unfolding SHA-1 40 stages pipelining give high throughput which is about 150.27 Gbps on Arria II GX. The maximum frequency of this design improves significantly on Xilinx Virtex 5 XC5VLX50T with 458.60 MHz as shown in Table 5. Since the latency for this unfolding SHA-1 design reduces from 80 cycles to 40 cycles, the throughput of this design increases up to 223.62 Gbps. Besides, the total register of the SHA-1 unfolding transformation with 40 stages pipelining reduces for both Altera and Xilinx implementation.

**Table 4**

<table>
<thead>
<tr>
<th>Design</th>
<th>Pipeline</th>
<th>Maximum Frequency (MHz)</th>
<th>ALUTs</th>
<th>Total Registers</th>
<th>Latency</th>
<th>Throughput (Mbps)</th>
</tr>
</thead>
<tbody>
<tr>
<td>SHA1_tte</td>
<td>1</td>
<td>279.64</td>
<td>402</td>
<td>720</td>
<td>82</td>
<td>1746.04</td>
</tr>
<tr>
<td>SHA1_inner</td>
<td>1</td>
<td>281.53</td>
<td>411</td>
<td>681</td>
<td>82</td>
<td>1757.85</td>
</tr>
<tr>
<td>SHA1_4</td>
<td>4</td>
<td>316.76</td>
<td>976</td>
<td>715</td>
<td>82</td>
<td>7911.27</td>
</tr>
<tr>
<td>SHA1_40</td>
<td>40</td>
<td>321.54</td>
<td>6715</td>
<td>3229</td>
<td>82</td>
<td>80306.57</td>
</tr>
<tr>
<td>SHA1_Unfold40</td>
<td>40</td>
<td>308.17</td>
<td>9799</td>
<td>718</td>
<td>42</td>
<td>150269.56</td>
</tr>
</tbody>
</table>

**Table 5**

<table>
<thead>
<tr>
<th>Design</th>
<th>Pipeline</th>
<th>Maximum Frequency (MHz)</th>
<th>Slice Register</th>
<th>Slice LUT</th>
<th>Latency</th>
<th>Throughput (Mbps)</th>
</tr>
</thead>
<tbody>
<tr>
<td>SHA1_tte</td>
<td>1</td>
<td>233.899</td>
<td>720</td>
<td>897</td>
<td>82</td>
<td>1460.44</td>
</tr>
<tr>
<td>SHA1_inner</td>
<td>1</td>
<td>232.918</td>
<td>681</td>
<td>950</td>
<td>82</td>
<td>1454.32</td>
</tr>
<tr>
<td>SHA1_4</td>
<td>4</td>
<td>430.024</td>
<td>974</td>
<td>1332</td>
<td>82</td>
<td>107410.09</td>
</tr>
<tr>
<td>SHA1_40</td>
<td>40</td>
<td>427.546</td>
<td>5953</td>
<td>6465</td>
<td>82</td>
<td>106782.22</td>
</tr>
<tr>
<td>SHA1_Unfold40</td>
<td>40</td>
<td>458.593</td>
<td>3448</td>
<td>11994</td>
<td>42</td>
<td>223618.68</td>
</tr>
</tbody>
</table>

V. CONCLUSION

In conclusion, the five SHA-1 designs such as SHA-1 iterative, SHA-1 inner pipelining, SHA-1 4 stages pipelining, SHA-1 40 stages pipelining and SHA-1 40 stages pipelining with unfolding factor 2 transformation were designed using both Altera Quartus II and Xilinx ISE design suite 14.7 and evaluated. Simulation results showed that SHA-1 40 stages pipelining with unfolding factor 2 transformations provided the higher speed and throughput of SHA-1 design. By using the techniques of pipelining and unfolding, the throughput increased up to 159.3 Gbps on Arria II GX and 223.6 Gbps on Virtex 5 XC5VLX50T. The maximum frequency of unfolding SHA-1 40 stages pipelining design was 308.17 MHz and 458.6 MHz on Arria II GX and Virtex 5 XC5VLX50T respectively. This design only used 9799 combinational ALUTs and 718 total registers on Arria II GX. For Virtex 5 XC5VLX50T, the usage of slice LUT and slice register was about 11994 and 3448 respectively. Hence, by using appropriate FPGA device, the high frequency and small area implementation can be obtained. This proposed SHA-1 design provided better performance compared with traditional designs. As a result, our proposed technique can improve the performance of SHA-1 design which is useful for security system design and other hash functions.

ACKNOWLEDGEMENTS

This project was supported by Universiti Malaysia Sarawak (UNIMAS).

REFERENCES