Views 
   PDF Download PDF Downloads: 216

 Open Access -   Download full article: 

Open CL Altera SDK v.14.0 vs. v. 13.1 Benchmarks Study

Abedalmuhdi Almomany* and Amin Jarrah

Computer Engineering Department, Yarmouk University, Irbid, Jordan

Corresponding Author E-mail: emomani@yu.edu.jo

DOI : http://dx.doi.org/10.13005/ojcst15.010203.03

Article Publishing History
Article Received on : 19 Oct 2022
Article Accepted on : 21 Nov 2022
Article Published : 25 Nov 2022
Plagiarism Check: Yes
Reviewed by: Dr. M Ravi
Second Review by: Dr. Parrull Nair
Final Approval by: Dr. Marcin Paprzycki
Article Metrics
ABSTRACT:

Altera SDK for OpenCL allows programmers to write a simple code in OpenCL and abstracts all Field programmable gate array (FPGA) design complexity. The kernels are synthesized to equivalent circuits using the FPGA hardware recourses: Adaptive logic modules (ALMs), DSPs and Memory blocks. In this study, we developed a set of fifteen different benchmarks, each of which has its own characteristics. Benchmarks include with/without loop unrolling, have/have not atomic operations, have one/multiple kernels per single file, and in addition to one/more of these characteristics are combined.  Altera OpenCL v14.0 adds more features compared with previous versions. A set of parameters chosen to compare the two OpenCL SDK versions: Logic utilization (in ALMs), total registers, RAM Blocks, total block memory bits, and clock frequency.

KEYWORDS: ALM; Altera; AOC; CPU; DE5; FPGA; GPU; OpenC

Copy the following to cite this article:

Almomany A, Jarrah A. Open CL Altera SDK v.14.0 vs. v. 13.1 Benchmarks Study. Orient.J. Comp. Sci. and Technol; 15(1,2,3).


Copy the following to cite this URL:

Almomany A, Jarrah A. Open CL Altera SDK v.14.0 vs. v. 13.1 Benchmarks Study. Orient.J. Comp. Sci. and Technol; 15(1,2,3). Available from: https://bit.ly/3Xt8SA9


Introduction

OpenCL stands for Open Computing Language, which is an open framework for parallel programming executed across heterogeneous platforms: CPUs, GPUs and DSPs [1]. OpenCL programming model consists of two programs; first, host program, which is usually written in C/C++, and it is responsible for loading the OpenCL programs, memory management, data transfer and errors checking [2]. Second program is the device code, which is written in OpenCL, and can be run on the available devices such as GPUs, DSPs, or FPGAs.

In OpenCL, kernel could be executed by a large number of work-items (threads). Work-items are organized in one, two or three dimensions, and are divided into blocks which can be multi-dimensions. Each block is called a workgroup. The size of a workgroup can be up to 1024 or 2048 work-items depending on device capability. All work-items inside the workgroup can be synchronized using barrier. However, synchronization cannot be between workgroups, and they could be executed in any order[7]. 

The Altera SDK for OpenCL allows the programmer to implement parallel algorithms on FPGA with a high level of hardware abstraction. The Altera offline compiler (AOC) is used to generate the Altera executable file, which can be run on the FPGA (DE5 in this study. Each kernel is synthesized to an equivalent circuit on the FPGA board, and each circuit contains a set of hardware recourses. FPGAs implement parallel algorithm using pipelining architecture where input data passes through a sequence of stages. [4,5] FPGA main resources include Adaptive logic modules (ALM), digital signal processing (DSP) and memory blocks.

AOC is used to create a hardware configuration file. Some parameters can be combined for the optimization purpose. Compilation process is very length, which can range anywhere between minutes and several days. In the set of benchmarks here, the compilation time ranges between one hour and few minutes up to six hours and few minutes.

The Altera SDK 14.0 has been developed to include new features, such as supporting hard floating points, channel extensions, supporting new types (float3) and other features [5].  Our motivation behind this study is to shows how these new features could affect the performance by compiling and running set of benchmarks.

FPGAs are widley used to improve the performance in several scientific applications [7-26].The FPGA device used here is Stratix V; ALM is developed to implement most of function efficiently. Each ALM contains a look-up table (eight inputs), two dedicated adders and four dedicated registers. The LUT can implement any 6-input logic functions and a number of 7-input functions. It can also be used as two separated LUTs for efficient using.  The block diagram for ALM is shown in figure-1. [6]

Expermintal Setup Environments

Linux  2.6.32-504.1.3.el6.x86_64

Altera SDK , 64-Bit Offline Compiler Ver. 14.0

Altera SDK , 64-Bit Offline Compiler Ver. 13.1

gcc version 4.4.7

DE5 Board (StratixV ,Dev 5SGXEA7N2F45C2)

Table 1:

Click here to View figure 

Experiment and Results Discussion

Several studies handel the issue of comparing different compilers[3,4]. To compare the two Altera SDK versions, a set of fifteen benchmarks were developed for comparison purpose. These benchmarks are varied in their characteristics as follows: none, one, or more atomic operations, with/without loop unrolling, single/multiple kernels per file.

The benchmarks written can be classified as pure memory access, where the whole kernel is written using reads or writes memory operations. The read/write operations could be atomic or non-atomic, using same or different atomic operation. “atomic add” and atomic exchange are used in this study. The other class is consisted of a set of arithmetic operations on floating points. These operations include four main operations (addition, subtraction, division, and multiplication) .The OpenCL kernels can repeat the same code many times, where loop unrolling is used in some kernels. The same kernels run again but without loop unrolling in other benchmarks. The last thing tested using theses benchmarks is repeating the same kernel in the file up to seven times, or using more than one kernel with different characteristics. In summary, a set of fifteen benchmarks summaries all of the above attributes.

A set of parameters are concerned here: logic utilization in ALMs, RAM blocks, total memory bits, clock frequency, total registers and compile time. Other parameters might be added here are size of configuration and backup files created. Our results show that the size of the files created by Altera 14.0 is less by 400MBs

The FPGA device used in the experiments contains 234,720 ALMs, 256 DSP Blocks, 52,428,800 block memory bits and 2,560 RAM Blocks.

Table I shows the results of all benchmarks for Altera 13.0. Logic utilization, RAM blocks and total memory bits are given in percent. Table II, shows similar benchmark runs for Altera 14.0.

Table 1: Altera 13.0 Benchmarks results

Click here to View table

Table 2: Altera 14.0 Benchmarks results

Click here to View table

Examining both tables, it is clear that the Altera SDK 14.0 shows better optimization of resources, and requires less compilation time. On the other hand, the clock frequency may not be enhanced, but may be decreased. Dividing the values in Table II by the corresponding values in Table I and averaging each row will generate the results shown in Table III. This gives a comparison between the two versions considering the parameters mentioned above. Taking the average effects of all parameters, every parameter is normalized to the Altera SDK 13.1 corresponding parameter.

Table 3: Comparison Results

Logic Utlization

Total Registers

Ram blocks

Total  block memory bits

Clock Frequencey

Compile time in minuites

Altera 13.1

1

1

1

1

1

1

Altera 14.0

0.9

0.86

0.74

0.96

0.99

0.89

Figure 2: Altera SDK 14.0 vs. Altera SDK 13.1 Comparison results

Click here to View figure

Conclusion

Our study shows that using the Altera SDK 14.0 for the previous benchmarks provides better recourses utilization. We need fewer resources compared to the Altera SDK 13.0.  Although the clock speed may decrease or increase, the changes is insignificant. We recommend using Altera SDK14.0 instead of Altera SDK 13.0. In future paper, the comparison will handle the most recent Intel FPGA compilers.

Conflict of Interest

This manuscript has not been submitted to, nor is under review at, another journal or other publishing venue. The authors certify that they have NO affiliations with or involvement in any organization or entity with any financial interest (such as honoraria; educational grants; participation in speakers’ bureaus; membership, employment, consultancies, stock ownership, or other equity interest; and expert testimony or patent-licensing arrangements), or non-fi nancial interest (such as personal or professional relationships, affiliations, knowledge or beliefs) in the subject matter or materials discussed in this manuscript.

Acknowledgement

None

Conflict of Interest

There is no conflict of interest.

Funding Source

The author(s) received no financial support for the research, authorship, and/or publication of this article.

References

  1. Almomany, A., Al-Omari, A. M., Jarrah, A., Tawalbeh, M., & Alqudah, A. (2020). An OpenCL-based parallel acceleration of a Sobel edge detection algorithm Using Intel FPGA technology. South African Computer Journal, 32(1), 3-26.
    CrossRef
  2. Abedalmuhdi, A., Wells, B. E., & Nishikawa, K. I. (2017, April). Efficient particle-grid space interpolation of an FPGA-accelerated particle-in-cell plasma simulation. In 2017 IEEE 25th Annual International Symposium on Field-Programmable Custom Computing Machines (FCCM) (pp. 76-79). IEEE
    CrossRef
  3. Almomany, A., Alquraan, A., & Balachandran, L. (2014). GCC vs. ICC comparison using PARSEC Benchmarks. International Journal of Innovative Technology and Exploring Engineering, 4(7).‏.
  4. Hiasat, R. H., & Almomani, A. A. (2013). Real Time Radio Frequency Identification Vehicles Data Logger Traffic Management System
  5. Almomany, A., Ayyad, W. R., & Jarrah, A. (2022). Optimized implementation of an improved KNN classification algorithm using Intel FPGA platform: Covid-19 case study. Journal of King Saud University-Computer and Information Sciences
    CrossRef
  6. http://www.altera.com/devices/fpga/stratix-fpgas/about/fpga-architecture/stx-architecture.html
  7. Almomany, A. M. (2017). Efficient openCL-based particle-in-cell simulation of auroral plasma phenomena within a commodity spatially reconfigurable computing environment
  8. Nishikawa, K., Almomany, A., & Wells, B. (2016, April). Two-dimensional PIC simulations of double layers in the upward current region of the aurora with quasi-dipole magnetic fields. In EGU General Assembly Conference Abstracts (pp. EPSC2016-3037)
  9. Jarrah, A., Almomany, A., Alsobeh, A. M., & Alqudah, E. (2021). High-performance implementation of wideband coherent signal-subspace (CSS)-Based DOA algorithm on FPGA. Journal of Circuits, Systems and Computers, 30(11), 2150196
    CrossRef
  10. Jarrah, A., Haymoor, Z. S., Al-Masri, H. M., & Almomany, A. (2022). High-Performance Implementation of Power Components on FPGA Platform. Journal of Electrical Engineering & Technology, 17(3), 1555-1571
    CrossRef
  11. Almomany, A., Sewell, S., Wells, B. E., & Nishikawa, K. I. (2017). A study of V-shaped potential formation using two-dimensional particle-in-cell simulations. Physics of Plasmas, 24(5), 052305.‏
    CrossRef
  12. Almomany, A., Jarrah, A., & Al Assaf, A. (2022). FCM Clustering Approach Optimization Using Parallel High-Speed Intel FPGA Technology. Journal of Electrical and Computer Engineering, 2022
    CrossRef
  13. Almomany, A., Al-Omari, A. M., Jarrah, A., & Tawalbeh, M. (2020). Discovering regulatory motifs of genetic networks using the indexing-tree based algorithm: a parallel implementation. Engineering Computations
    CrossRef
  14. Jarrah, A., Al-Tamimi, A. K., & Albashir, T. (2018). Optimized parallel implementation of extended Kalman filter using FPGA. Journal of Circuits, Systems and Computers, 27(01), 1850009
    CrossRef
  15. Al Bataineh, A., Kaur, D., & Jarrah, A. (2018, July). Enhancing the parallelization of backpropagation neural network algorithm for implementation on fpga platform. In NAECON 2018-IEEE National Aerospace and Electronics Conference (pp. 192-196). IEEE
    CrossRef
  16. Jarrah, A., Jamali, M. M., & Hosseini, S. S. S. (2014, June). Optimized FPGA based implementation of particle filter for tracking applications. In NAECON 2014-IEEE National Aerospace and Electronics Conference (pp. 233-236). IEEE
    CrossRef
  17. Alqudah, E., & Jarrah, A. (2020). Parallel implementation of genetic algorithm on FPGA using Vivado high level synthesis. International Journal of Bio-Inspired Computation, 15(2), 90-99.‏
    CrossRef
  18. Jarrah, A., & Jamali, M. M. (2014, November). Optimized FPGA based implementation of discrete wavelet transform. In 2014 48th Asilomar Conference on Signals, Systems and Computers (pp. 1839-1842). IEEE.‏
    CrossRef
  19. Jarrah, A., & Jamali, M. (2013). Software tool for efficient FPGA design of direct data domain approach for space‐time adaptive processing. Electronics letters49(13), 789-791.‏
    CrossRef
  20. Al Bataineh, A., & Jarrah, A. (2022). High performance implementation of neural networks learning using swarm optimization algorithms for EEG classification based on brain wave data. International Journal of Applied Metaheuristic Computing (IJAMC)13(1), 1-17.‏
    CrossRef
  21. Jarrah, A. A., & Jamali, M. M. (2016). FPGA based architecture of extensive cancellation algorithm (ECA) for passive bistatic radar (PBR). Microprocessors and Microsystems41, 56-66.‏
    CrossRef
  22. Jarrah, A., & Jamali, M. M. (2015). Reconfigurable FPGA/GPU-based architecture of block compressive sampling matching pursuit algorithm. Journal of Circuits, Systems and Computers24(04), 1550055.‏
    CrossRef
  23. Al Bataineh, A., Jarrah, A., & Kaur, D. (2019). High-speed FPGA-based of the particle swarm optimization using HLS tool. International Journal of Advanced Computer Science and Applications10(5)
    CrossRef
  24. Jarrah, A., Haddad, B., Al-Jarrah, M. A., & Obeidat, M. B. (2017). Optimized parallel architecture of evolutionary neural network for mass spectrometry data processing. International Journal of Modeling, Simulation, and Scientific Computing8(01), 1750016.‏
    CrossRef
  25. Jarrah, A., Jamali, M. M., Hosseini, S. S. S., Astola, J., & Gabbouj, M. (2015). Parralelization of non-linear & non-Gaussian Bayesian state estimators (Particle filters). In 2015 23rd European Signal Processing Conference (EUSIPCO) (pp. 2506-2510). IEEE.‏
    CrossRef
  26. Jarrah, A., & Jamali, M. M. (2013, November). Software tool for FPGA based MIMO radar applications. In 2013 Asilomar Conference on Signals, Systems and Computers (pp. 1792-1795). IEEE
    CrossRef

Creative Commons License
This work is licensed under a Creative Commons Attribution 4.0 International License.