• 面向云端FPGA的卷积神经网络加速器的设计及其调度

    Subjects: Computer Science >> Integration Theory of Computer Science submitted time 2018-11-29 Cooperative journals: 《计算机应用研究》

    Abstract: Convolutional neural network's high computational complexity often obstructs its widespread adhibition in real-time and low-power applications. The existing software implementation solution cannot meet the demands of the convolutional neural network for computing performance and power consumption. The traditional FPGA-oriented convolutional neural network construction method has problems such as complicated process, long cycle and small optimization space. For these problems, according to the characteristics of convolutional neural network calculation pattern, this paper proposed a design and scheduling mechanism of convolutional neural network accelerator for cloud FPGAs. By using for reference the design which based HLS technology, imported the cyclic cutting parameters and rearranged the convolution layer circularly. Then constructed the network in a modular way, and extended parameters to further optimize the accelerator processing process. Summarized the scheduling scheme by analyzing the characteristics of system tasks and resources, and optimized its design from two aspects of control and data flow. In comparison with other existing work, the proposed design provided a solution with flexibility, low energy consumption, high energy efficiency and performance. The design also discussed the efficient universal scheduling scheme of the accelerator. Experimental results show that compared with the CPU implementation, this design achieves 8.84 times speedup of AlexNet, while the power consumption of Cifar implementation is only 24.96% of it. Compared with the CPU+GPU to achieve 6.90 times speedup of Cifar, although the performance of large-scale network is inferior to the GPU, but the minimum power consumption is only 14.98%. This design achieves the maximum acceleration of 6.29 times in comparison with the existing research results. Compared to the accelerators generated for large platforms, even if it only has comparable performance but with a lower clock frequency.