In this article we present two novel multipurpose reservoir optimization algorithms named nested stochastic dynamic programming (nSDP) and nested reinforcement learning (nRL). Both algorithms are built as a combination of two algorithms; in the nSDP case it is (1) stochastic dynamic programming (SDP) and (2) nested optimal allocation algorithm (nOAA) and in the nRL case it is (1) reinforcement learning (RL) and (2) nOAA. The nOAA is implemented with linear and non-linear optimization. The main novel idea is to include a nOAA at each SDP and RL state transition, that decreases starting problem dimension and alleviates curse of dimensionality. Both nSDP and nRL can solve multi-objective optimization problems without significant computational expenses and algorithm complexity and can handle dense and irregular variable discretization. The two algorithms were coded in Java as a prototype application and on the Knezevo reservoir, located in the Republic of Macedonia. The nSDP and nRL optimal reservoir policies were compared with nested dynamic programming policies, and overall conclusion is that nRL is more powerful, but significantly more complex than nSDP.
INTRODUCTION
Optimal reservoir operation (ORO) is a multi-objective problem that is often solved by dynamic programming (DP) and stochastic dynamic programming (SDP). These two methods suffer from the so-called ‘dual curse’ which forbids them to be employed in reasonably complex water systems. The first one is the ‘curse of dimensionality’ that is characterized with an exponential growth of the computational complexity with the state–decision space dimension (Bellman 1957). The second one is the ‘curse of modelling’ that requires an explicit model of each component of the water system (Bertsekas & Tsitsiklis 1996) to calculate the effect of each system's transition. The curse of dimensionality limits the number of state–action variables and prevents DP and SDP being used in complex reservoir optimization problems.
There have been various attempts to overcome the curses (Castelletti et al. 2012; Li et al. 2013; Anvari et al. 2014), or earlier DP-based on successive approximations (Bellman & Dreyfus 1962), incremental DP (Larson 1968) and differential DP (DDP) (Jacobson & Mayne 1970). The DDP starts with an initial guess of values and policies for the goal and continues with improving the policy using different techniques (Atkeson & Stephens 2007). The incremental DP attempts to find a global solution to a DP problem by incrementally improving local constraint satisfaction properties as experience is gained through interaction with the environment (Bradtke 1994).
In the last decade, there has been significant RL research and applications in ORO. Researchers from the Polytechnic University of Milan (Italy) have developed SDP and a number of RL implementations in ORO (Castelletti et al. 2001, 2007). The study by Castelletti et al. (2002) proposes a variant of Q-learning named Qlp (Q-learning planning) to overcome the limitations of SDP and standard Q-learning by integrating the off-line approach, typical for SDP and model-free characteristic of Q-learning. The vast state–action space in most cases is extremely difficult to express with a lookup table so often a generalization through a function approximation (for example by a neural network) is required (see e.g., Bhattacharya et al. 2003). A similar approach, proposed by Ernst et al. (2006), called ‘fitted Q-iteration’, combines RL concepts of off-line learning and functional approximation of the value function. Recent RL methods (Castelletti et al. 2010) have been using tree-based regression for mitigating the curse of dimensionality. The application of various ORO methods are reviewed in Yeh (1985) and for multireservoir systems in Labadie (2004).
This paper address the multi-objective ORO problem of satisfying multiple objectives related to: (1) reservoir releases to satisfy multiple downstream users competing for water with dynamically varying demands; (2) deviations from target water levels in the reservoir (recreation and/or flood control); and (3) hydropower production that is a combination of the reservoir water level and the reservoir releases. This problem when posed has multiple objectives (eight in our case study) and decision variables (six in our case study), and it is unsolvable with standard ORO algorithms because of the curse of dimensionality. The main objective is to research and develop algorithms that can solve previously mentioned multi-objective ORO problems, alleviating the curse of dimensionality.
We have developed two new algorithms named nested SDP (nSDP) and nested RL (nRL). These algorithms are similar to an already published nested dynamic programming (nDP) algorithm (Delipetrev et al. 2015) that is compared with already existing DP methods. At each state transition of the nSDP and nRL, an additional nOAA is executed to allocate optimal releases to individual water users, which lowers the starting problem dimension and successfully alleviates the curse of dimensionality. The nOAA is implemented with (1) simplex for linear allocation and (2) weighted quadratic deficits for non-linear allocation.
The nSDP and nRL algorithms were tested at the Knezevo multipurpose reservoir of the Zletovica hydro-system located in the Republic of Macedonia. The Zletovica hydro-system is a relatively complex water resource system, including one reservoir, Knezevo, significant tributary inflow downstream of the reservoir, several intake points, several water supply and irrigation users, and hydropower. The specific problem addressed here is how to operate the Knezevo reservoir, to satisfy as much as possible water users and other objectives. The main issue is to include five water users, two towns and two agricultural users, ecological demand, minimum and maximum reservoir critical levels, and hydropower, creating an optimization problem with a total of eight objectives and six decision variables.
This article is organized in six sections. The next section describes the ORO problem and the nSDP and nRL algorithms followed by a section explaining the Zletovica case study and optimization problem formulation. As the Zletovica case study is not a classical single reservoir optimization problem, in the next section we describe the nSDP and nRL implementation and then the experimental settings. Results and discussion follow and finally the conclusions.
THE ORO PROBLEM AND NOVEL ALGORITHMS
nSDP
The nSDP pseudo code is shown in Algorithm 1:
(1) Discretize the reservoir inflow qt into L intervals i.e., qlt (k=1, 2…, L).
(2) Create the transition matrices TM that describe the transition probabilities .
(3) Discretize storage st and st+1 in m intervals, i.e., sit (i=1, 2, …, m), sjt+1 (j=1, 2, …, m) (in this case xt=st) and set k=0.
(4) Set time t=T− 1 and k=k+ 1.
(5) Set reservoir level i= 1 (for time step t).
(6) Set reservoir level j = 1 (for time step t + 1).
(7) Set inflow cluster l= 1 (for time step t).
(8) Calculate the total release rt using Equation (4).
(9) Execute the nested optimization algorithm to allocate the total release to all users {r1t, r2t,…rnt} in order to meet their individual demands.
(10) Calculate the g (xt, xt+1, at) and update V(xt).
(11) l=l+ 1.
(12) If l≤L, go to step 8.
(13) Jj=j+ 1.
(14) If j≤m, go to step 7.
(15) Select the optimal actions (decision variables) {a1t, a2t…ant}opt, which consist of the optimal transition {xt+1}opt and the users releases {r1t, r2t,…rnt}opt that give minimal value of V(xt).
(16) i=i + 1.
(17) If i≤m, go to step 6.
(18) If t > 0
(19) t = t − 1
(20) Go to step 4.
(21) If t = 0, check if the optimal actions (decision variables) {a1t, a2t…ant}opt, are changed from the previous episode (or in the last three consecutive episodes)? If they are changed, go to step 4, otherwise stop.
Underlined step 9 in Algorithm 1 is the nOAA. Algorithm 1 presents the general nSDP algorithm that depending on the case study needs to be adjusted, as will be demonstrated in the following sections.
nRL
Reinforcement learning (RL) is a machine learning method that maps situation and actions to maximize the cumulative reward signal. The RL components are an agent, an environment, and a reward function. The environment is observable to the agent through state xt (state variables). The agent observes state xt and takes action at. The environment reacts to this action, and based on the changes in the environment, gives a reward g (xt, xt+1, at) to the agent. The main difference between the RL and SDP is while SDP makes an exhaustive optimization search over all possible state–action space, the RL optimization is incremental for the currently visited state, or SDP applies breadth-first search, while RL applies single-step depth-first search (Lee & Labadie 2007).
The nRL design can support several state xt and action at variables. One of the possible nRL design is to define the state xt= {t, st, qt}, action at= {st+1} and reward g (xt, xt+1, at). Often, the RL action is defined by the reservoir release at= {rt}, but since these variables are dependent in the mass balance Equation (4) it does not make any conceptual difference, e.g., when the next reservoir volume st+1 is known, the release rt can be calculated, or vice versa. There are RL action implementation differences between the next reservoir volume st+1 and the reservoir release rt, which are discussed in the section ‘nRL application to the case study’.
(1) Divide the inflow into N episodes for each year.
(2) Discretize the reservoir inflow qt into L intervals, making L intervals centers qlt (k= 1, 2…, L).
(3) Discretize storage st in m intervals, making m discretization levels sit (i= 1, 2, …, m).
(4) Set initial variables: α, γ, maximum number of episodes – M, learning threshold – LT.
(5) Set T as period that defines the number of time steps t in episode (in our case 52 for weekly and 12 for monthly).
(6) Set LR=0.
(7) Set n = 1 (number of an episode).
(8) Set t = 1 (time step of a period).
(9) Define initial state xt with selecting a starting reservoir volume sit.
(10) Get the reservoir inflow qlt and t from the current episode.
(11) Select action at (exploration, or exploitation) and make transition xt+1.
(12) Calculate the reservoir release rt based on xt, xt+1, qkt, and Equation (4).
(13) Execute the nested optimization with distributing the reservoir release rt between water demand users using linear or quadratic formulation, calculate the deficits and other objectives, and calculate the reward g (xt, xt±1, at).
(14) Calculate the state action value Q (xt, at).
(15) Calculate learning update |Q (xt+1, at+1) – Q (xt, at)| and add it to LR.
(16) t=t+ 1 and move agent to state xt+1.
(17) If t<T then go to step 10.
(18) If t=T then n=n+ 1.
(19) If n < N then set new episode data and go to step 8.
(20) If n=N and LR>LT then go to step 6.
(21) If n=N and LR<LT then stop.
(22) If n=M then stop.
The main nRL feature is step 13 that executes the nOAA. The nRL design can support the additional state variables, as will be demonstrated in the case study implementation presented in the following sections.
Approaches to nested optimization of step-wise resource allocation
The nOAA are the same as in Delipetrev et al. (2015) where two methods are used to optimally allocate the total reservoir release rt between n water users: simplex method in the case of linear problem and weighted quadratic deficit for non-linear problem. Each water user is described with its demand dit and corresponding weight Wit at time step t. For the nested optimal allocation, the following variables are relevant: d1t, d2t…dnt are users' demands; W1t, W2t…Wnt are the corresponding demands' weights; rt is the reservoir release; r1t..rnt are the users' releases; v is the release discretization value.
If the release rt can satisfy the aggregated demand of all users, then the optimal allocation is not performed since all the releases can be set to their demands.
Linear method
Minimization of the optimization problem is performed on the release variable rit.
Non-linear method
CASE STUDY
There are five towns in this region (Kratovo, Probishtip, Zletovo, Shtip and Sveti Nikole) and two large agriculture irrigation regions named upper and lower zone. The region is characterized by mountainous topography; the system is also designed to include several small hydropower plants, most of which are located downstream of the Knezevo reservoir as derivational power plants that utilize the natural head differences created by the topography. The hydropower system is still in development and according to the feasibility study report (GIM 2010) the plan is to build eight hydropower plants.
The GIM (2008) report contains a detailed hydrological model of the Zletovo river basin with four river flow measurement points on the River Zletovica. The monthly time series data of the river flow measurement points from the year 1951 to 1991 are used in this research. There is a significant tributary inflow between the Knezevo reservoir and the last branching point. First, the tributary inflow is used to satisfy the water demand objectives, and if additional water quantities are needed then they are released from the reservoir.
The rit represent each water user release quantity. The numbering (r3t to r7t) is selected to fit the optimization formulation in which objectives related to reservoir water level are numbered with indexes 1 and 2, as will be shown below. The solid line represents the main Zletovica river, and the significant left tributary. Some of the presented variables are further explained in the following subsection.
The hydro-system is modelled in a lumped way such that water from the tributary inflow is first allocated to all users, and after that the reservoir releases are used to satisfy the remaining user demands. The tributary inflow is calculated as a difference between the last river measurement point q3t and reservoir inflow qt. The analysis proved that this assumption holds and it is possible to consider tributary inflow as the total water quantity available to all users. This approach decreases the number of system variables characterizing the water users (these two variables are used however for some hydropower calculations). In this system, the main water users are the towns Shtip and Sveti Nikole and both agricultural zones.
The maximum water level in the Knezevo reservoir considered for this study is Hmax = 1,061.5 m amsl, which is, in fact, the normal operational level. This level corresponds to max storage volume of Vmax = 23.5 × 106 m3. The minimum storage volume (dead storage) in the Knezevo reservoir is Vmin = 1.50 × 106 m3 corresponding to Hmin = 1,015.0 m amsl water level, leaving effectively 22.0 × 106 m3 of storage volume in the Knezevo reservoir for balancing available inflows with downstream water demands.
Formulation of the optimization problem
Other HEC are calculated in the same way as HEC0, with the very important notion that HEC2 and HEC3 are using q1t and q2t variables. All HEC coefficients are taken from GIM (2010). All the hydropower plants together produce the total energy pt.
The action vector at consist of six actions or decision variables: st+1, r3t, r4t, r5t, r6t, r7t which are the next optimal reservoir state and water user releases at each time step. Using these decision variables, it is possible to calculate all other variables and OFs.
NSDP AND NRL ALGORITHMS IMPLEMENTATION
nSDP application to the case study
The nSDP was adjusted to accommodate the case study optimization problem formulation presented in the section ‘Case study’. The main implementation issue in applying nSDP is how to include the four stochastic variables q, q1t, q2t and , as shown in Figure 2. There is no example of numerical solution of SDP with four stochastic variables without provoking the curse of dimensionality. Perhaps it is possible to design it mathematically, but the practical implementation will probably be very difficult and impractical.
The alternative approach is to investigate the correlation between the reservoir qt and the tributary inflow. The correlation coefficient between these two variables is about 0.9 on weekly data. The high correlation gives the opportunity to include the tributary inflow as another stochastic variable in the nSDP algorithm. If this were not the case (low correlation coefficient), then another approach would be needed. The nSDP with the two stochastic variables can be implemented only if the reservoir qt and tributary inflow belong to the same cluster at each time step. It is worth noting that the high correlation coefficient typically suggests that the values of both variables belong to the same cluster interval at each time step over the entire modelling period.
The correlation analysis between reservoir inflow qt and tributary inflow bring us to a possible solution to discard other stochastic variables q1t and q2t and simplify the optimization problem formulation. The stochastic variables q1t and q2t are only used in calculation of the hydropower OF, and do not affect other OF. The consequence of optimization problem simplification and adjustment is the impossibility to calculate HEC2 and HEC3 power production (and the total hydropower production as well) using nSDP. Therefore, the hydropower aspect is not included in nSDP.
Algorithm 3 adds and changes several steps of Algorithm 1 to implement the Zletovica case study and is shown below:
(1a) Discretize the tributary inflow into L intervals i.e., .
(2a) Create the transition matrices TM that describe the transition probabilities of tributary inflow qTr.
(7) Set reservoir inflow and tributary inflow cluster l = 1 (for time step t) (the reservoir and tributary inflow clusters are the same).
(7a) Distribute the tributary inflow using nested optimization between water demand users and calculate their remaining deficits.
(9) Execute the nested optimization algorithm to allocate the total release to all users {r3t, r4t, r5t, r6t, r7t} in order to meet their remaining deficits and calculate D1, D2, D3, D4, D5, D6, D7 and D8.
Algorithm 3 has additional steps (1a) and (2a) that are added after Algorithm 1 steps (1) and (2) correspondingly, and these steps calculate the tributary inflow variable. Both variables, the reservoir inflow qt and tributary inflow , are discretized using K-means algorithm. Algorithm 3 step 7 replaces Algorithm 1 step 7, where the same cluster is set for the reservoir and tributary inflow. Steps (8a), (9) and (9a) are adding/replace steps as described previously.
nRL application to the case study
The nRL includes all case study variables (, q1t and q2t), and implements the optimization problem formulation as described in the case study section. The nRL executes multiple episodes with deterministic variables time series data, where each episode is 1 year. The nRL implementation is very specific for this ORO problem described in the case study. That is why often designing and implementing RL (and other machine learning techniques) is an art, because the modellers construct the entire system, define variables, states, actions, rewards, etc.
The primary design decision in the nRL (and RL in general) is to determine the state, the action and the reward variables. Three different approaches to define states xt were tested: (1) xt= {t, st}, (2) xt= {t, st, qt}, (3) . The nRL action and reward were the same in all three approaches. Тhe action at is described with the next storage volume at= {st+1} and consequently ‘nested’ releases at= {st+1, rt, r3t, r4t, r5t, r6t, r7t}. The reward g (xt, at, xt+1) is defined as an optimization problem formulation or Equation (15). The only difference is that deviation is with a negative sign and the nRL OF is to maximize negative deviation. The maximal gain is 0 when the objective is satisfied.
As mentioned before, the action can be described as the reservoir release at= {rt}. This does not make any conceptual difference considering equations, since the next state st+1 can be calculated from the mass balance equation, but it is more complicated to implement. In our case, the reservoir storage st and reservoir inflow qt are discretized, and the evaporation et is calculated using st and st+1. If a reservoir release action rt is selected, then based on the mass balance equation the calculated next reservoir volume st+1 will fall in between two discretized storage volumes.
Instead, it is much more convenient and easier to implement the next reservoir volume as action at= {st+1}. In that case, the start and next discrete storage volumes st and st+1, and discretized reservoir inflow qt are defined, and the evaporation et and the reservoir release rt can be easily calculated.
The state space grows exponentially with the additional state variables. The state space directly influences the computational time and the agent's ability to learn. However, the action space stays the same due to the ‘nested’ methodology. A third approach was used, that increased the state vector dimension to about 94,900 cells (52 weeks × 73 reservoir level × 5 reservoir inflow discretization × 5 tributary inflow discretization). Because the agent explores/exploits the possible actions over the modelling period, it is very likely that some of the Q (xt, at) in the matrix will be unused. The solution selected for dealing with this issue was to use the HashMap function supported in Java. Both nSDP and nRL are developed in Java. The nRL implementation pseudo code on the case study is shown in Algorithm 4 below:
(2a) Discretize the tributary inflow into L intervals i.e., .
(9) Define initial state xt with an initial reservoir volume st, read the reservoir qkt, tributary inflow cluster value , q1t and q2t from the current episode, and the time step t.
(12a) Distribute the tributary inflow using nested optimization between water demand users and calculate their remaining deficits.
(13) Execute the nested optimization with distributing the reservoir release rt between water demand users {r3t, r4t, r5t, r6t, r7t} satisfying the remaining deficits, calculate D1, D2, D3, D4, D5, D6, D7 and D8, and calculate the reward g (xt, xt+1, at).
Algorithm 4 steps (2a) and (12a) are added after Algorithm 2 steps (2) and (12). Steps (9) and (13) of Algorithm 4 replace the same step from Algorithm 2.
EXPERIMENTAL SETTINGS
The available 55 years' weekly data are separated into two parts: (1) training and (2) testing. The data from 1951 to 1994 (2,340 time steps) are used for training and 1994–2004 (520 time steps) for testing. The nSDP training data consist of reservoir inflow qt and tributary inflow in the previously mentioned period. The nRL training data consist of reservoir qt and tributary inflow , and the two other flows q1t and q2t are used for hydropower calculation. The nSDP and nRL data for minimum and maximum levels, water supply, irrigation demands, ecological flow and hydropower are set to the 2005 weekly data presented in the case study section, and they are the same in the training and testing periods. The reservoir operation volume is discretized in 73 equal levels (300 × 103 m3). The minimum level was set at 1,021.5 m amsl and the maximum level at 1,060 m amsl. The weights applied in these experiments are shown in Table 1. At the beginning, the nested optimization algorithm (linear or quadratic) and the number of clusters (in our case five) are selected in both nSDP and nRL. Both nSDP and nRL have the same experimental settings.
Experiments . | w1 . | w2 . | w3 . | w4 . | w5 . | w6 . | w7 . | w8 . |
---|---|---|---|---|---|---|---|---|
nDP-L5, nDP-Q5; nSDP-L5, nSDP-Q5; nRL-L5, nRL-Q5 | 2,000,000 | 2,000,000 | 200 | 1 | 200 | 1 | 300 | 0.01 |
Experiments . | w1 . | w2 . | w3 . | w4 . | w5 . | w6 . | w7 . | w8 . |
---|---|---|---|---|---|---|---|---|
nDP-L5, nDP-Q5; nSDP-L5, nSDP-Q5; nRL-L5, nRL-Q5 | 2,000,000 | 2,000,000 | 200 | 1 | 200 | 1 | 300 | 0.01 |
The main OF combines three distinct objective types: the minimum and maximum reservoir critical levels that are measured in m, the water user demands that are measured in 103 m3/per time step (week or month) and the hydropower energy production that is measured in MWh/per time step (week or month). This is the main reason why the weights have different magnitudes. A similar approach is taken in other previous research studies (e.g., Pianosi & Soncini-Sessa 2009; Rieker 2010; Quach 2011).
The weights are set according to the objective importance and create the ORO policy. The most important objective is the environmental flow (w7) followed by cities' water demands (w3 and w5), agriculture demands (w4 and w6), and lastly hydropower production (w8). The hydropower weights were set extremely low for two main reasons: (1) the hydropower objective by the reports is considered as a by-product from reservoir operation and not its main feature; and (2) to lower as much as possible the influence of hydropower in ORO, and have a valid ground in comparing nSDP and nRL.
The nDP results of testing data are ‘the optimal operation’, meaning that nDP is a deterministic optimization and calculates the ORO. The nSDP and nRL, on the other hand, are trained on training data, producing a policy, and have not seen the testing data. The nDP results are used as a benchmark for the nSDP and nRL policies. The closer the policies derived by nSDP and nRL are to nDP, the better they are.
The algorithms are additionally labelled to denote the deficit formulations used in the nOAA. For example, nDP-L5 stands for nDP using the linear deficits' formulation, and nDP-Q5 stands for nDP using the quadratic deficits' formulation. The nRL parameters at the beginning are set at: α0 = 0.8, γ = 0.5 and ɛ = 0.8. The parameter α is set to decrease linearly with the number of episodes. The maximum number of episodes is set to M = 400,000. There were various approaches tested for decreasing ε, and the one used in the experiments is decreasing ε on each 100,000 episodes by half. Starting at ε0 = 0.8 and with increasing the number of episodes the agent is making less exploration and more exploitation actions, and insuring convergence to the optimal solutions.
RESULTS AND DISCUSSION
The absolute difference between the nRL-L5 and nDP-L5 optimal reservoir volumes can be used as the stopping criterion. Obviously, the nRL-L5 optimal reservoir policy performs best between 80,000 and 160,000 episodes of training. Afterwards, the policy somewhat deteriorates, although it is still relatively good.
Figure 5 results show that nDP-L5 and nDP-Q5, which are the target, have very low D1 and D2 deviation, and that nRL-L5 and nRL-Q5 are better than nSDP-L5 and nSDP-Q5. The same applies in D3–D7 deviations, while in D8 nSDP-L5 and nSDP-Q5 are not calculated. The nRL-L5 and nRL-Q5 produce better ORO policies than the nSDP-L5 and nSDP-Q5, as shown in Figure 5.
CONCLUSIONS
The paper presented nSDP and nRL novel ORO algorithms that can solve problems with multiple decision variables, successfully alleviating the curse of dimensionality. These algorithms were implemented and tested in the case of the Zletovica hydro-system with eight objectives and six decision variables.
The nSDP has issues in implementing several state variables without provoking the curse of dimensionality, thus adjustments were needed to fit nSDP to the case study optimization problem requirement. The nRL showed its true power with including all four stochastic variables implementing the complete optimization problem formulation, but its implementation and tuning requires additional effort. The main conclusion from the implementation of the algorithms is that nDP can implement complex optimization problem formulations without significant problems. The nSDP has limitations when additional optimization problem variables are included. The nRL is very powerful in implementing complex optimization problems, but needs tuning concerning its design, parameters, action list, convergence criteria, etc.
The presented nSDP and nRL and their implementation in the case study of the Zletovica river basin confirmed that in some situations the curse of dimensionality and computational complexity can be overcome. There could be situations where, for example, in applying nSDP and the correlation between the reservoir and tributary inflow is low, this proposed solution is not applicable. This restricts the possibility of nSDP application to a subset of reservoir problems. In this particular case, the two stochastic variables' approximation in nSDP was the best approach. In any case, the nSDP algorithm is limited in few stochastic variables before breaking the curse of dimensionality and becoming computationally unsolvable. On the other hand, nRL demonstrated its full capacity in including multiple stochastic variables and solving this problem and, at least on the conceptual level, can be applied to much more complex single and multireservoir problems.
The nSDP and nRL were used to derive 1-year weekly optimal reservoir policy. The available weekly data (1951–2004) were divided into a training (1951–1994) and testing part (1994–2004). The nSDP and nRL optimized/learned the optimal reservoir policy on training data, and their policy was examined on the testing data. The nDP solved the ORO problem in the testing period (1994–2004) and this solution was used as a target for both nSDP and nRL. Interesting results were to observe how the nRL agent learns with the increase of the number of episodes. The nRL optimal reservoir policy is best between 80,000 and 160,000 learning episodes. The nSDP and nRL policies were benchmarked against the nDP results and it was found that the nRL performs better than nSDP overall and for all objectives separately. The main conclusion is that the nRL is a better choice than the nSDP, at least for the considered case study.
The presented nSDP and nRL algorithms are successfully tested on a relatively complex single ORO problem, as the Zletovica case study is. Generally, nSDP cannot be applied in two and more multireservoir systems because of the curse of dimensionality. On the other hand, nRL supports several stochastic variables, as demonstrated in this case study, and in our opinion could be a potential solution for multireservoir ORO problems. However, as stated previously, the nRL implementation is difficult and highly problem specific.
The developed nested algorithms are computationally efficient and can be run on standard personal computers. For the considered case study, on a standard PC, nDP executes in 1–3 min, nSDP in 2–5 min, while nRL is 8–15 min (the longest is nRL-Q).
The ORO is a multi-objective problem by its nature because often different objectives (water demands, hydropower and reservoir levels) are concerned. In this research, it was first reduced to a single objective optimization problem by employing the single-objective aggregated weighted sum function. It is possible to make several single-objective optimization algorithms that are executed multiple times with several weight sets, i.e., multi-objective optimization by a sequence of single-objective optimization searches. This method can be applied to nDP, nSDP and nRL, which will create fully fledged multi-objective algorithms. Future research will focus on designing and developing multi-objective variants on the nDP, nSDP and nRL producing Pareto set.