The shuffled complex evolution optimization developed at the University of Arizona (SCE-UA) has been successfully applied in various kinds of scientific and engineering optimization applications, such as hydrological model parameter calibration, for many years. The algorithm possesses good global optimality, convergence stability and robustness. However, benchmark and real-world applications reveal the poor computational efficiency of the SCE-UA. This research aims at the parallelization and acceleration of the SCE-UA method based on powerful heterogeneous computing technology. The parallel SCE-UA is implemented on Intel Xeon multi-core CPU (by using OpenMP and OpenCL) and NVIDIA Tesla many-core GPU (by using OpenCL, CUDA, and OpenACC). The serial and parallel SCE-UA were tested based on the Griewank benchmark function. Comparison results indicate the parallel SCE-UA significantly improves computational efficiency compared to the original serial version. The OpenCL implementation obtains the best overall acceleration results however, with the most complex source code. The parallel SCE-UA has bright prospects to be applied in real-world applications.

## INTRODUCTION

Global optimization is a hot but difficult issue in scientific research and engineering applications, such as hydrological model parameter optimization. It has been widely recognized that the achievement of a stable global optimum for highly complex benchmarks and real-world applications is still challenging. Issues, such as local minimums, high dimensionality, and premature convergence, etc., have become major obstacles for appropriate solution of optimization problems. To overcome these obstacles, many optimization methods and algorithms have been proposed. These include grid search, Monte Carlo random search, genetic algorithm, particle swarm optimization, SCE-UA (shuffled complex evolution optimization developed at the University of Arizona), etc. A huge number of optimization methods have been developed and applied in various kinds of scientific research and engineering applications; however, only a small number of methods perform satisfactorily in highly complex and nonlinear real-world applications such as hydrological model parameter optimization. Among these methods and algorithms, SCE-UA has been widely accepted and used in scientific and engineering optimization applications, especially for hydrological model parameter global calibration (Duan 1991; Li *et al.* 2014, 2016; Dong *et al.* 2015; Kan *et al.* 2015a, 2015b, 2016a, 2016b, 2016c, 2016d, 2016e; Zuo *et al.* 2016). It features good global optimality, stable robustness, and excellent consistency. Therefore, the SCE-UA method has been recognized as an effective and robust tool for solving the above-mentioned obstacles, to some extent.

Even though SCE-UA has achieved great success in many fields, it still faces a severe computational efficiency issue when the optimization problem is highly complex. It needs to carry out large numbers of objective function evaluations during the evolution process. Taking the hydrological model parameter optimization as an example, it needs to repeatedly run the hydrological model to obtain the objective function values in each iteration. The computational burden is very high. It usually takes days or even weeks to finish the optimization task and this situation is not acceptable. The SCE-UA method is usually implemented by using serial programming and computing technology, such as serial Fortran, C/C++, and MATLAB, etc. For complex optimization problems such as hydrological model parameter automatic calibration, the computational load is very high and the execution speed of the serial code is too slow. The computational efficiency of the SCE-UA needs to be improved and researchers have made some efforts to improve it. There are mainly two ways: one is the improvement of the algorithm itself (Wang *et al.* 2014; Gong *et al.* 2015, 2016; Zhang *et al.* 2017) and the other is acceleration by using parallel computing supported new generation multi-core CPUs (central processing units) and many-core GPUs (graphics processing units) (Sharma *et al.* 2006; Muttil *et al.* 2007; Kan *et al.* 2016a, 2016b, 2016c, 2016d, 2016e). For the improvement of the algorithm itself it is necessary to introduce an artificial neural network or other data-driven surrogate model to mimic the objective function response surface. The selection of the surrogate model may affect the optimization accuracy and this operation has the possibility of optimization accuracy deterioration. The improvement of computational efficiency by using surrogate models has a computational efficiency upper limit which is limited by the specifically selected surrogate model. If users want to further improve the computational efficiency they have to select simpler surrogate models and this may lead to the deterioration of optimization accuracy. Acceleration by using highly parallel computing hardware, such as multi-core CPUs and many core GPUs, does not face the above-mentioned problems. It is based on algorithm parallelization and parallel programming technology. It can achieve satisfactory speedup ratio without loss of optimization accuracy. The only cost is the parallel algorithm implementation by using new programming techniques such as OpenMP and CUDA.

The acceleration of the SCE-UA by using multi-core CPUs and many-core GPUs based on OpenMP and CUDA has been proposed and tested in previous literature. However, there are many popular parallel programming techniques, such as OpenMP, OpenCL, CUDA, and OpenACC, which can be applied to the implementation of the parallel SCE-UA. Performances and code complexities of different implementations may vary significantly. To achieve effective and efficient use of the parallel SCE-UA, it is of great importance to make comparisons and give some useful guidelines for the selection and utilization of different parallel programing techniques. Unfortunately, further investigations of the parallel SCE-UA method, such as parallel algorithm implementations by using different popular parallel programming techniques, like OpenMP, OpenCL, CUDA, and OpenACC, and performance comparisons, have not been carried out. To carry out more in-depth research on the parallel SCE-UA and compare algorithm performances and code complexities of different implementations, we proposed a parallel SCE-UA method and implemented it on an Intel Xeon multi-core CPU and NVIDIA Tesla many-core GPU by using OpenMP, OpenCL, CUDA (NVIDIA Corporation 2015), and OpenACC. Performance tests of the serial and parallel SCE-UA were carried out based on the Griewank benchmark function to test the execution speed, optimization accuracy, robustness, and consistency. The source codes of OpenMP, OpenCL, CUDA, and OpenACC implementations were also analyzed to study the relationship between the speedup ratio and source code complexity. Some useful guidelines are given to potential users of parallel SCE-UA to help them achieve better acceleration results.

## SERIAL SCE-UA

The original serial SCE-UA is based on a synthesis of four concepts: (1) combination of deterministic and probabilistic approaches; (2) systematic evolution of a ‘complex’ of points spanning the parameter space, in the direction of global improvement; (3) competitive evolution (Holland 1975); and (4) complex shuffling (Duan *et al.* 1992, 1993; Sorooshian *et al.* 1993). A general description of the steps of the SCE-UA is given below (a more detailed presentation can be found in Duan *et al.* (1992, 1993)).

- (1)
Generate sample. Sample

*s*points randomly in the feasible parameter space and compute the criterion value at each point. In the absence of prior information on the approximate location of the global optimum, use a uniform probability distribution. - (2)
Rank points. Sort the

*s*points in order of increasing criterion value so that the first point represents the smallest criterion value (assuming that the goal is to minimize the criterion value). - (3)
Partition into complexes. Partition the

*s*points into*p*complexes, each containing*m*points. The complexes are partitioned such that the first complex contains every*p*(*k*– 1) + 1 ranked point, the second complex contains every*p*(*k*– 1) + 2 ranked point, and so on, where*k*= 1, 2, …,*m*. - (4)
Evolve each complex. Evolve each complex according to the competitive complex evolution (CCE) algorithm (which is elaborated upon below).

- (5)
Shuffle complexes. Combine the points in the evolved complexes into a a single sample population; sort the sample population in order of increasing criterion value; re-partition the sample population into

*p*complexes according to the procedure specified in Step 3. - (6)
Check convergence. If any of the convergence criteria (stated below) are satisfied, stop; otherwise, return to Step 4.

The SCE-UA method iterates to converge towards the global optimal point. Therefore, it should be stopped by using some termination criteria to indicate the arrival of the optimal point. The criteria discussed below were used in this research.

- (a)Objective function convergence: The objective function convergence indicates that the search should be stopped when the algorithm is unable to further improve the objective function value over a pre-specified number of iterations (
*K*). This is implemented by:where_{stop}*f*and_{i}*f*_{i}_{-1}denote the objective function values at the*i*th and*i*-1th shuffling loop, respectively;*Tolerance*denotes the objective function convergence criterion value._{obj}

- (b)Parameter convergence: When the algorithm cannot significantly change the parameter values over one or more iterations, the search should be stopped. The parameter convergence criterion is as follows:where and denote the
*j*th parameter values at the*i*th and*i*-1th shuffling loop, respectively; and denote the*j*th parameter's maximum and minimum values, respectively;*Tolerance*denotes the parameter convergence criterion value._{param}

- (c)
Maximum objective function evaluations: Maximum objective function evaluations is used as a backup to prevent wasting computing resources. If the number of objective function evaluations exceeds this criterion value, the search is stopped.

The CCE algorithm, based on the Nelder & Mead (1965) simplex downhill search scheme, is presented briefly as follows.

(I) Construct a subcomplex by randomly selecting

*q*points from the complex according to a trapezoidal probability distribution. The probability distribution is specified such that the best point has the highest chance of being chosen to form the subcomplex, and the worst point has the least chance.(II) Identify the worst point of the subcomplex and compute the centroid of the subcomplex without including the worst point.

(III) Attempt a reflection step by reflecting the worst point through the centroid. If the newly generated point is within the feasible space, go to Step IV; otherwise, randomly generate a point within the feasible space and go to Step VI.

(IV) If the newly generated point is better than the worst point, replace the worst point by the new point. Go to Step VII. Otherwise, go to Step V.

(V) Attempt a contraction step by computing a point halfway between the centroid and the worst point. If the contraction point is better than the worst point, replace the worst point by the contraction point and go to Step VII. Otherwise, go to Step VI.

(VI) Randomly generate a point within the feasible space. Replace the worst point by the randomly generated point.

(VII) Repeat Steps II–VI

*α*times, where*α*≥ 1 is the number of consecutive offspring generated by the same subcomplex.(VIII) Repeat Steps I–VII

*β*times, where*β*≥ 1 is the number of evolution steps taken by each complex before complexes are shuffled (Duan*et al.*1994).

## PARALLEL SCE-UA

- (1)
Generate sample in parallel. Create

*s*parallel executed threads; in each thread sample a point randomly in the feasible parameter space and compute the criterion value at this point. In the absence of prior information, use a uniform probability distribution. For OpenMP and OpenACC implementations, this step is implemented through a for loop which is decorated by the ‘#pragma omp parallel for’ OpenMP directive and the ‘#pragma acc kernels loop’ OpenACC directive. For OpenCL and CUDA implementations, this step is implemented by using the OpenCL kernel and the CUDA kernel. - (2)
Rank points. Sort the

*s*points in parallel in order of increasing criterion value. For OpenMP and OpenACC implementations, this step is implemented by the hand-written parallel radix sort C/C++ code. For OpenCL implementation, this step is implemented by using the open source Boost.Compute library. For CUDA implementation, this step is implemented by using the open source NVIDIA thrust library. - (3)
Partition into complexes. Partition the

*s*points into*p*complexes, each containing*m*points. The complexes are partitioned such that the first complex contains every*p*(*k*– 1) + 1 ranked point, the second complex contains every*p*(*k*– 1) + 2 ranked point, and so on, where*k*= 1, 2, …,*m*. This step is run in serial. - (4)
Evolve each complex. Create

*p*parallel executed threads; in each thread evolve each complex according to the CCE algorithm described above. For OpenMP and OpenACC implementations, this step is implemented through a for loop which is decorated by the ‘#pragma omp parallel for’ OpenMP directive and the ‘#pragma acc kernels loop’ OpenACC directive. For OpenCL and CUDA implementations, this step is implemented by using the OpenCL kernel and the CUDA kernel. - (5)
Shuffle complexes. Combine the points in the evolved complexes into a single sample population (this is run in serial); sort the sample population in order of increasing criterion value (this is run in parallel, the same as Step 2); re-partition the sample population into

*p*complexes according to the procedure specified in Step 3 (this is run in serial). - (6)
Check convergence. If any of the pre-specified convergence criteria are satisfied, stop; otherwise, return to Step 4. This step is run in serial.

## HARDWARE AND SOFTWARE USED IN THIS STUDY

In this study, the performance tests were carried out based on the Intel Xeon E5-2640v2 CPU and the NVIDIA Tesla K40c GPU. The CPU and GPU were hosted by a HP Z820 workstation. The hardware used in this study is listed in Table 1.

Computer name . | Device name . | Device description . |
---|---|---|

HP Z820 | CPU: Intel Xeon E5-2640v2 | 8 physical cores (16 logical cores), 2.0 GHz, 0.032TFLOPS single precision compute capability |

GPU1: NVIDIA Tesla K40c | 2,880 cores, 875 MHz boost on, 12GB ECC memory, 4.29TFLOPS single precision compute capability | |

GPU2: NVIDIA Quadro K6000 | 2,880 CUDA cores, 12GB ECC memory, used for display | |

System memory: Samsung DDR3 | 32GB ECC memory | |

Hard drive disk: West Digital | 1TB SATA, 7,200 rpm |

Computer name . | Device name . | Device description . |
---|---|---|

HP Z820 | CPU: Intel Xeon E5-2640v2 | 8 physical cores (16 logical cores), 2.0 GHz, 0.032TFLOPS single precision compute capability |

GPU1: NVIDIA Tesla K40c | 2,880 cores, 875 MHz boost on, 12GB ECC memory, 4.29TFLOPS single precision compute capability | |

GPU2: NVIDIA Quadro K6000 | 2,880 CUDA cores, 12GB ECC memory, used for display | |

System memory: Samsung DDR3 | 32GB ECC memory | |

Hard drive disk: West Digital | 1TB SATA, 7,200 rpm |

This study was based on the Microsoft Windows 7 platform. The source code was developed under the Microsoft Visual Studio 2010 Integrated Development Environment (IDE). The serial and OpenMP codes were developed by using the Microsoft Visual C++ 2010 compiler with OpenMP supported. The OpenCL and CUDA codes were developed by using the NVIDIA NVCC compiler, OpenCL 1.2, and CUDA 6.5. The OpenACC code was developed by using the PGI Accelerator workstation C/C++ 15.10 compiler with OpenACC supported (this compiler is run under the Cygwin by using the command shell).

## RESULTS AND DISCUSSION

The serial and parallel SCE-UA methods were tested based on the Griewank benchmark problems. The performances were compared and the sensitivity was analyzed based on the testing results. As the calculation of the Griewank benchmark function is very fast and cannot mimic the heavy computational burden of the complex problem's objective function, such as hydrological model simulation, we added four kinds of arithmetic operations after the calculation of the Griewank function for each objective function evaluation. These operations include *nobj* additions, *nobj* subtractions, *nobj* multiplications, and *nobj* divisions of a dummy temporary variable. For the GPU version SCE-UA, the time consumed by the memory transfer is considered in the total execution time to ensure a fair comparison between the CPU and GPU versions.

*i*th decision variable;

*nopt*denotes the number of decision variables.

- (1)
Comparison of execution time of serial, OpenMP, OpenCL, CUDA, and OpenACC SCE-UA.

The execution time of the different SCE-UA implementations are listed in Figure 2. The execution time of the serial SCE-UA varied from 93.7 s to 90,233.0 s. The execution time of the OpenMP SCE-UA varied from 23.2 s to 6,028.3 s. The execution time of the OpenCL CPU SCE-UA varied from 12.5 s to 773.3 s. The execution time of the OpenCL GPU SCE-UA varied from 92.2 s to 634.0 s. The execution time of the CUDA SCE-UA varied from 93.5 s to 645.0 s. The execution time of the OpenACC SCE-UA varied from 94.4 s to 627.4 s. We can see that the serial SCE-UA runs very slowly. The OpenMP version runs faster than the serial version. The OpenCL CPU version runs faster than the OpenMP version. The GPU versions (including the OpenCL GPU, CUDA, and OpenACC versions) run very fast when the number of complexes became very large and run slower than the serial version when the number of complexes is small. When the number of complexes increased, the CPU versions (including the serial, OpenMP and OpenCL CPU versions) consumed more and more time. However, the execution time of the GPU versions did not change significantly.

- (2)
Comparison of speedup ratio of OpenMP, OpenCL, CUDA, and OpenACC SCE-UA versus serial SCE-UA.

The speedup ratio of different SCE-UA implementations are listed in Figure 3. The speedup ratio of the OpenMP SCE-UA varied from 3.59× to 15.80×. The speedup ratio of the OpenCL CPU SCE-UA varied from 6.65× to 136.84×. The speedup ratio of the OpenCL GPU SCE-UA varied from 0.86× to 230.60×. The speedup ratio of the CUDA SCE-UA varied from 0.84× to 213.59×. The speedup ratio of the OpenACC SCE-UA varied from 0.83× to 198.61×. We can see that with the increasing number of complexes (i.e., the number of parallel executed threads), the parallel SCE-UA runs faster than the serial version. For a smaller number of complexes, the OpenMP and OpenCL CPU versions run faster. For large numbers of complexes, the GPU versions run faster. The OpenCL GPU SCE-UA run faster than the CUDA SCE-UA. The CUDA SCE-UA run faster than the OpenACC SCE-UA.

- (3)
Analysis of the impact of objective function computational load.

Here we fixed

*nopt*to 20 (because SCE-UA are mostly applied in the field of hydrological model parameter calibration and most hydrological models have less than 20 parameters) and changed the*p*(i.e., the number of complexes) and*nobj*(reflecting the objective function computational load) to test the impact of objective function computational load on the speedup ratio. We can see in Figure 4 that with the increasing of the*nobj*, the speedup ratio increased. This means that the speedup ratio can be larger with increasing objective function computational load. With the increasing of the*p*, the GPU versions run increasingly faster. This fact indicates that the GPU versions run very fast for large*p*and the CPU versions run very fast for small*p*. - (4)
Analysis of impact of error correction code (ECC) setting for GPU device.

The Tesla GPU used in this study supported ECC. We tested the impact of ECC setting to the execution time. The execution time ratio of ECC on versus ECC off is demonstrated in Figure 5. For OpenCL GPU SCE-UA, the number of ratio large than 1, smaller than 1, and equal to 1 was 17, 11, and 17 respectively, whilst for CUDA SCE-UA, it was 14, 17, and 14, respectively. For OpenACC SCE-UA, the number of ratio large than 1, smaller than 1, and equal to 1 was 36, 9, and 0, respectively. The results indicate that from the point of view of average and overall performance, the OpenCL GPU version runs slower when ECC is on, the CUDA version runs faster when ECC is on, and the OpenACC version runs slower when ECC is on. The OpenACC version is most sensitive to the ECC on setting.

- (5)
Comparison of source code complexity.

The source code length was analyzed here. The lengths of the serial, OpenMP, OpenCL CPU, OpenCL GPU, CUDA, and OpenACC implementations were 941, 941, 1,972, 1,972, 1,402, and 1,005, respectively. Considering the acceleration results mentioned above, although the OpenCL versions obtained the highest overall speedup ratio, they were the most complex implementation. The CUDA version is the moderate complex implementation. The complexity of the OpenMP and OpenACC versions is almost same as the original serial version.

- (6)
Comparison of optimization accuracy.

We tested all the optimizations and found that all optimizations converged to the theoretical global optimum (i.e., the origin). This result indicated that the optimization accuracy of the parallel SCE-UA is satisfactory.

## CONCLUSIONS

This paper aimed at the parallelization and acceleration of the SCE-UA method based on powerful heterogeneous computing technology. The parallel SCE-UA was implemented on an Intel Xeon multi-core CPU (by using OpenMP and OpenCL) and a NVIDIA Tesla many-core GPU (by using OpenCL, CUDA, and OpenACC). The serial and parallel SCE-UA methods were tested based on the Griewank benchmark function. Comparison results indicate the parallel SCE-UA significantly improved the computation efficiency compared to the original serial version. The OpenCL implementation obtained the best overall acceleration results, however it is the most complex source code. The parallel SCE-UA has bright prospects to be applied in real-world applications.

## ACKNOWLEDGEMENTS

This research was funded by the China Postdoctoral Science Foundation on Grant (Grant No. 2016M600096); Research on Key Technologies of Real Time Dynamic Control of Water Level in Flood Season of Cascade Reservoirs (KY1734); Third Sub-Project: Flood Forecasting, Controlling and Flood Prevention Aided Software Development – Flood Control Early Warning Communication System and Flood Forecasting; Controlling and Flood Prevention Aided Software Development for Poyang Lake Area of Jiangxi Province (0628-136006104242, JZ0205A432013, SLXMB200902); Construction project of Shaanxi province medium and small river hydrological monitoring and forecast system – construction of Guanzhong and north of Shaanxi flood forecast scheme (JZ0205A112015); National Key Research and Development Plan (2016YFC0803107); Major International (Regional) Joint Research Project – China's Water and Food Security under Extreme Climate Change Impact: Risk Assessment and Resilience (G0305, 7141101024), International Project (71461010701); Study of distributed flood risk forecast model and technology based on multi-source data integration and hydro meteorological coupling system (2013CB036406); and China National Flash Flood Disaster Prevention and Control Project (126301001000150068). We gratefully acknowledge the support of NVIDIA Corporation with the donation of the Tesla K40 GPU used for this research. The authors wish to acknowledge Dr Mark Amo-Boateng for his help in developing the CUDA-based SCE-UA method. Guangyuan Kan and Ke Liang contributed equally to this work.