Pipeline performance optimization of AES algorithm

2022-08-14
  • Detail

Pipeline performance optimization of AES algorithm on reconfigurable platform

Abstract AES Rijndael algorithm is a new generation of block encryption algorithm standard that replaces DES in the United States, and it is also a de facto international standard. In this paper, the pipeline performance optimization technology of 128 bit key length AES algorithm is studied on the reconfigurable platform. Through the discussion and implementation of basic operation optimization, loop unrolling, in wheel pipeline, inter wheel pipeline, hybrid multistage pipeline structure optimization and other methods, the advantages, disadvantages and applicable environment of different optimization methods are compared. Experiments show that the encryption performance of different structures is very different. Among them, the encryption performance of the hybrid multi-level pipeline structure reaches the rate of 27.1 GB/s, which is a good result of relevant research at home and abroad

keyword Advanced Encryption Standard reconfigurable computing pipeline structure optimization AES Rijndael algorithm

1 overall structure of AES rijndaei algorithm

aes Rijndael algorithm is a block cipher algorithm with replacement replacement network structure. Its design is based on polynomial operations over finite fields. The main structure of the cryptographic algorithm consists of four parts: subbytes, which performs S-box nonlinear transformation; Shiftrows, horizontal displacement of state matrix; MixColumn columns, which performs matrix multiplication over the finite field GF (28); Addroundk eys, which connects the sub key with the state matrix through a simple XOR operation. The algorithm encrypts a 128 bit plaintext packet into a 128 bit ciphertext packet through NR round transformation. Here, NR is a constant related to the key length. For a 128 bit key, the value of NR is 10. Except for the last round, the encryption process is the same for every other round. The MixColumns transform is omitted in the last round of encryption to resist some special cryptanalysis

2 loop expansion and pipeline optimization

for the implementation of the aforementioned AES encryption process, as shown in Figure 1 (a), we optimized the implementation method of four transformation functions under the reconfigurable platform, and the optimized encryption process clock frequency reached 127.9 MHz. Thus, the encryption of a 128 bit plaintext packet requires 11 clock cycles, so the encryption rate is 1.49 GB/s. This rate can meet the needs of most applications. However, in order to meet the needs of some higher speed applications, the corresponding design can be further optimized by changing the encryption process architecture. To improve the encryption rate, the simplest method is to use the cyclic expansion technology to expand the structure of iterative encryption, and connect the input and output of multiple encryption wheel conversion circuits end to end, as shown in Figure 1 (b). This can save the register establishment delay and the transmission delay time of the selector, so as to speed up the encryption transformation processing. However, this method will consume a lot of logical resources, and the performance is not much improved. According to our experiments, the speed of the chip after loop expansion is 17% higher than that of the iterative structure, but the logic resources consumed are 6 times that of the iterative structure. Therefore, the efficiency of this optimization method is extremely low; It can be adopted only when the performance requirements are high, but the resources are sufficient

another optimization method is to use pipeline technology. It divides the critical execution path into multi-level short execution steps, and inserts registers between the circuits of each execution step to store the execution results of the previous level. In this way, although the critical execution path is not shortened, the circuit can process the encryption of multiple data blocks at the same time in a few clock cycles, thereby improving the degree of concurrency. Therefore, the encryption rate can be greatly improved. In the process of designing encryption algorithm processing pipeline, the most commonly used is inter round pipeline technology. Round robin pipeline technology is to divide the loop expansion structure into a series of pipelines according to each encryption round, in which each round of encryption is transformed into one level. Registers are inserted between the stages of the pipeline. The registers are controlled by the synchronous clock. Each time the clock triggers, the register saves the last transformation result, and sends the previously saved result to the pipeline processing unit for processing and saves the result to the next level register, as shown in Figure 1 (c). According to our experimental results (see Figure 3), the encryption rate of the design of the inter wheel pipeline structure is 12 times that of the iterative structure; At the same time, the resource consumption has also increased significantly, which is 7 times that of the iterative structure. The experimental results show that the inter round pipeline technology is an effective optimization technology for the packet encryption algorithm. Especially when the encryption round function in the algorithm is relatively simple, the inter round pipeline technology is quite applicable; However, for some algorithms with complex round functions and less encryption rounds, the optimization effect of round robin pipeline technology is not very prominent

chowowiec et al. Proposed in round pipeline technology to optimize this kind of complex algorithm of encryption transformation round. The in round pipeline divides the encryption round into multiple levels, and inserts registers between each level to realize the pipeline, as shown in Figure 2 (a). The advantage of this method is that the increased resource consumption is very small, and only multi-level registers are needed; However, there are also shortcomings. It is difficult for the in wheel pipeline to balance the delay between stages, and the overall clock frequency can only be determined by the delay of the longest pipeline. In our experiment, AES encryption functions are divided into 4-stage pipelines according to their constituent modules. It is also possible to divide them into more stages, but it is more difficult, because long structures such as S-box are difficult to be divided again, and their delay will determine the overall clock frequency

the execution efficiency of the in wheel pipeline structure is five times higher than that of the iterative structure, while the required resources are 11% less than that of the iterative structure according to the experimental results shown in Figure 3. After analysis, the inter round pipeline structure adds the module register, and the required resources should be increased, but the actual synthesis result is reduced. Therefore, we analyzed the comprehensive report of each structure in the two modules in detail. From the data reported, it should be the optimization of the design by the logic synthesis software, which reduces the resources required for the inter wheel pipeline structure

in order to achieve extremely high encryption speed, the in wheel pipeline and out of wheel pipeline are combined, and a mixed in wheel and out of wheel pipeline structure is designed. The pipeline structure inside and outside the hybrid wheel has a very short pipeline single-stage delay, so the clock frequency can be increased to 212.5 MHz. At the same time, the pipeline structure inside and outside the hybrid wheel can complete the encryption of a data packet in each clock cycle, so that the encryption speed can reach 27.1 GB/s. This speed is higher than that in the current report on the implementation of AES's high-speed encryption chip. In order to achieve such a high encryption speed, the required resources are also considerable. The logic synthesis results show that 17887 logic units are required to complete this design, as shown in Figure 4. This is equivalent to the capacity of four Xilinx xc2v1000 FPGAs. At the same time, we also evaluated the efficiency of various implementation structures, and used the rate resource ratio, that is, the ratio obtained by automatically calculating the MB number of internal bonding strength and static bending strength divided by the number of logical units required by the design, which can be added to the national standard gb/t17657 ⑴ 999 "experimental methods for physical and chemical properties of wood-based panels and veneered wood-based panels" per second, as the efficiency of the structure. As can be seen from Figure 5, the inner wheel circulation structure is the most efficient design, with a ratio of 3.49; The efficiency of the cyclic expansion structure is the lowest, only 0.12. Therefore, under the condition of relatively limited logical resources, it is more appropriate to choose to use the in wheel loop

3 conclusion

to sum up, in addition to the optimization of the basic operation transformation of AES encryption algorithm, the impact of the overall implementation structure of the algorithm on its encryption performance is a very important aspect. In general, in the environment that does not require high efficiency, the "iteration" structure is relatively suitable because it is simple to implement and requires the least resources; However, in order to achieve higher encryption efficiency and low cost, it is a reasonable compromise to adopt in wheel pipeline structure; Only when a large number of resources are available and the highest encryption performance is pursued, it is necessary to adopt the preparation technology of artificial blood vessels woven with modified new materials, biological composite artificial blood vessels and new membrane covered blood vessels; New artificial heart valve preparation technology; Multistage mixed pipeline structure between the preparation technology of skull repair materials and nerve repair materials

Copyright © 2011 JIN SHI