In this article we present two novel multipurpose reservoir optimization algorithms named nested stochastic dynamic programming (nSDP) and nested reinforcement learning (nRL). Both algorithms are built as a combination of two algorithms; in the nSDP case it is (1) stochastic dynamic programming (SDP) and (2) nested optimal allocation algorithm (nOAA) and in the nRL case it is (1) reinforcement learning (RL) and (2) nOAA. The nOAA is implemented with linear and non-linear optimization. The main novel idea is to include a nOAA at each SDP and RL state transition, that decreases starting problem dimension and alleviates curse of dimensionality. Both nSDP and nRL can solve multi-objective optimization problems without significant computational expenses and algorithm complexity and can handle dense and irregular variable discretization. The two algorithms were coded in Java as a prototype application and on the Knezevo reservoir, located in the Republic of Macedonia. The nSDP and nRL optimal reservoir policies were compared with nested dynamic programming policies, and overall conclusion is that nRL is more powerful, but significantly more complex than nSDP.

## INTRODUCTION

Optimal reservoir operation (ORO) is a multi-objective problem that is often solved by dynamic programming (DP) and stochastic dynamic programming (SDP). These two methods suffer from the so-called ‘dual curse’ which forbids them to be employed in reasonably complex water systems. The first one is the ‘curse of dimensionality’ that is characterized with an exponential growth of the computational complexity with the state–decision space dimension (Bellman 1957). The second one is the ‘curse of modelling’ that requires an explicit model of each component of the water system (Bertsekas & Tsitsiklis 1996) to calculate the effect of each system's transition. The curse of dimensionality limits the number of state–action variables and prevents DP and SDP being used in complex reservoir optimization problems.

There have been various attempts to overcome the curses (Castelletti *et al.* 2012; Li *et al.* 2013; Anvari *et al.* 2014), or earlier DP-based on successive approximations (Bellman & Dreyfus 1962), incremental DP (Larson 1968) and differential DP (DDP) (Jacobson & Mayne 1970). The DDP starts with an initial guess of values and policies for the goal and continues with improving the policy using different techniques (Atkeson & Stephens 2007). The incremental DP attempts to find a global solution to a DP problem by incrementally improving local constraint satisfaction properties as experience is gained through interaction with the environment (Bradtke 1994).

In the last decade, there has been significant RL research and applications in ORO. Researchers from the Polytechnic University of Milan (Italy) have developed SDP and a number of RL implementations in ORO (Castelletti *et al.* 2001, 2007). The study by Castelletti *et al.* (2002) proposes a variant of Q-learning named Qlp (Q-learning planning) to overcome the limitations of SDP and standard Q-learning by integrating the off-line approach, typical for SDP and model-free characteristic of Q-learning. The vast state–action space in most cases is extremely difficult to express with a lookup table so often a generalization through a function approximation (for example by a neural network) is required (see e.g., Bhattacharya *et al.* 2003). A similar approach, proposed by Ernst *et al.* (2006), called ‘fitted *Q*-iteration’, combines RL concepts of off-line learning and functional approximation of the value function. Recent RL methods (Castelletti *et al.* 2010) have been using tree-based regression for mitigating the curse of dimensionality. The application of various ORO methods are reviewed in Yeh (1985) and for multireservoir systems in Labadie (2004).

This paper address the multi-objective ORO problem of satisfying multiple objectives related to: (1) reservoir releases to satisfy multiple downstream users competing for water with dynamically varying demands; (2) deviations from target water levels in the reservoir (recreation and/or flood control); and (3) hydropower production that is a combination of the reservoir water level and the reservoir releases. This problem when posed has multiple objectives (eight in our case study) and decision variables (six in our case study), and it is unsolvable with standard ORO algorithms because of the curse of dimensionality. The main objective is to research and develop algorithms that can solve previously mentioned multi-objective ORO problems, alleviating the curse of dimensionality.

We have developed two new algorithms named nested SDP (nSDP) and nested RL (nRL). These algorithms are similar to an already published nested dynamic programming (nDP) algorithm (Delipetrev *et al.* 2015) that is compared with already existing DP methods. At each state transition of the nSDP and nRL, an additional nOAA is executed to allocate optimal releases to individual water users, which lowers the starting problem dimension and successfully alleviates the curse of dimensionality. The nOAA is implemented with (1) simplex for linear allocation and (2) weighted quadratic deficits for non-linear allocation.

The nSDP and nRL algorithms were tested at the Knezevo multipurpose reservoir of the Zletovica hydro-system located in the Republic of Macedonia. The Zletovica hydro-system is a relatively complex water resource system, including one reservoir, Knezevo, significant tributary inflow downstream of the reservoir, several intake points, several water supply and irrigation users, and hydropower. The specific problem addressed here is how to operate the Knezevo reservoir, to satisfy as much as possible water users and other objectives. The main issue is to include five water users, two towns and two agricultural users, ecological demand, minimum and maximum reservoir critical levels, and hydropower, creating an optimization problem with a total of eight objectives and six decision variables.

This article is organized in six sections. The next section describes the ORO problem and the nSDP and nRL algorithms followed by a section explaining the Zletovica case study and optimization problem formulation. As the Zletovica case study is not a classical single reservoir optimization problem, in the next section we describe the nSDP and nRL implementation and then the experimental settings. Results and discussion follow and finally the conclusions.

## THE ORO PROBLEM AND NOVEL ALGORITHMS

### nSDP

*V*(

*x*) represents state value function,

_{t}*g (x*,

_{t}*x*

_{t}_{+1},

*a*) represents the reward function of transition between state

_{t}*x*and state

_{t}*x*

_{t}_{+1}¸

*a*is the decision–action vector including releases for multiple users,

_{t}*γ*is the discount factor to ensure convergence and is the probability for a reservoir inflow

*q*that is in interval

_{t}*i*in time step

*t*, to become that is in interval

*j*in time step

*t*+ 1, also shown in Equation (2). The minimization is performed on state and action variables:

*x*includes the reservoir storage

_{t}*s*, and the hydrometeorological information; which, in our case, is the reservoir inflow

_{t}*q*in the time [

_{t}*t, t*+ 1], making the state vector

*x*

_{t}*=*{

*s*,

_{t}*q*}. The reservoir is governed by the mass balance equation: where

_{t}*r*is the reservoir release and

_{t}*e*is the reservoir evaporation.

_{t}*a*defines the policy

_{t}*p*based on the state

*x*as shown in Equation (5): The objective functions (OFs) can be aggregated into single-objective aggregated weighted sum function as show in Equation (6): where

_{t}*g*(

_{t}*x*,

_{t}*x*

_{t}_{+1},

*a*) is the aggregated reward of

_{t}*n*objectives at time step

*t*,

*w*is the objective weight at time step

_{it}*t*and

*g*(

_{it}*x*,

_{t}*x*

_{t}_{+1},

*a*) is the step reward of each objective at time step

_{t}*t*.

*n*competing users, and this multiplies the total number of decision variables. This bring us to the main novelty and idea of the nested algorithms, and that is to execute nOAA at each state transition, optimizing releases for the water demand users, considering at the same time other objectives and constraints. That is the only difference between the nSDP and classical SDP, or the executing of the nOAA at each state transition, as shown in Figure 1. The

*d*…

_{1t}*d*are the water demands, while

_{nt}*r*…

_{1t}*r*are their respected releases at each time step, as shown in Figure 1. The reward

_{nt}*g*(

_{t}*x*,

_{t}*x*

_{t}_{+1},

*a*) is calculated at each nSDP transition, taking into account all objectives including starting and ending storage volumes

_{t}*s*and

_{t}*s*

_{t}_{+1}, release

*r*that is divided by nOAA to all users and functions

_{t}*r*…

_{1t}*r*depending on their demands

_{nt}*d*…

_{1t}*d*and priorities. Additional explanations about nesting and its features can be found in the nDP paper (Delipetrev

_{nt}*et al.*2015).

The nSDP pseudo code is shown in Algorithm 1:

(1) Discretize the reservoir inflow *q _{t}* into

*L*intervals i.e.,

*q*(

_{lt}*k*

*=*

*1, 2…, L*).

(2) Create the transition matrices *TM* that describe the transition probabilities .

(3) Discretize storage *s _{t}* and

*s*in

_{t+1}*m*intervals, i.e.,

*s*(

_{it}*i*

*=*

*1, 2, …, m*),

*s*(

_{jt+1}*j*

*=*

*1, 2, …, m*) (in this case

*x*

_{t}*=*

*s*and set

_{t})*k*

*=*

*0*.

(4) Set time *t**=**T**−* 1 and *k**=**k**+* 1.

(5) Set reservoir level *i**=* 1 (for time step *t*).

(6) Set reservoir level *j* = 1 (for time step *t* + 1).

(7) Set inflow cluster *l**=* 1 (for time step *t*).

(8) Calculate the total release *r _{t}* using Equation (4).

(9) Execute the nested optimization algorithm to allocate the total release to all users {*r*_{1t}*, r*_{2t}*,…r _{nt}*} in order to meet their individual demands.

(10) Calculate the *g* (*x _{t}, x_{t+}*

_{1}

*, a*) and update

_{t}*V(x*.

_{t})(11) *l**=**l**+* 1.

(12) If *l**≤**L*, go to step 8.

(13) *Jj**=**j**+* 1.

(14) If *j**≤**m*, go to step 7.

(15) Select the optimal actions (decision variables) {*a*_{1t}*, a*_{2t}*…a _{nt}*}

*, which consist of the optimal transition {*

_{opt}*x*

_{t+}_{1}}

_{op}_{t}and the users releases {

*r*

_{1t}

*, r*

_{2t}

*,*}

_{…}r_{nt}_{opt}that give minimal value of

*V*(

*x*).

_{t}(16) *i**=**i* + 1.

(17) If *i**≤**m*, go to step 6.

(18) If t > 0

(19) *t* = *t* − 1

(20) Go to step 4.

(21) If t = 0, check if the optimal actions (decision variables) {*a*_{1t}*, a*_{2t}*…a _{nt}*}

*, are changed from the previous episode (or in the last three consecutive episodes)? If they are changed, go to step 4, otherwise stop.*

_{opt}Underlined step 9 in Algorithm 1 is the nOAA. Algorithm 1 presents the general nSDP algorithm that depending on the case study needs to be adjusted, as will be demonstrated in the following sections.

### nRL

Reinforcement learning (RL) is a machine learning method that maps situation and actions to maximize the cumulative reward signal. The RL components are an agent, an environment, and a reward function. The environment is observable to the agent through state *x _{t}* (state variables). The agent observes state

*x*and takes action

_{t}*a*. The environment reacts to this action, and based on the changes in the environment, gives a reward

_{t}*g*(

*x*

_{t}, x_{t+}_{1}

*, a*) to the agent. The main difference between the RL and SDP is while SDP makes an exhaustive optimization search over all possible state–action space, the RL optimization is incremental for the currently visited state, or SDP applies breadth-first search, while RL applies single-step depth-first search (Lee & Labadie 2007).

_{t}*Q*(

*x*) is the state–action value function,

_{t}, a_{t}*α*is the learning rate coefficient,

*x*,

_{t}*a*,

_{t}*γ*and

*g*(

*x*

_{t}, x_{t+}_{1}

*, a*) have been described before. The maximization is performed on state and action variables.

_{t}The nRL design can support several state *x _{t}* and action

*a*variables. One of the possible nRL design is to define the state

_{t}*x*

_{t}*=*{

*t, s*}, action

_{t}, q_{t}*a*

_{t}*=*{

*s*} and reward

_{t+1}*g*(

*x*

_{t}, x_{t+}_{1}

*, a*). Often, the RL action is defined by the reservoir release

_{t}*a*

_{t}*=*{

*r*}, but since these variables are dependent in the mass balance Equation (4) it does not make any conceptual difference, e.g., when the next reservoir volume

_{t}*s*is known, the release

_{t+1}*r*can be calculated, or vice versa. There are RL action implementation differences between the next reservoir volume

_{t}*s*and the reservoir release

_{t+1}*r*, which are discussed in the section ‘nRL application to the case study’.

_{t}*N*years of available historical time series data of reservoir inflow, these data are divided appropriately in

*N*episodes. The

*RL*agent includes several parameter settings as previously described:

*α*– the learning rate;

*γ*– the discount factor;

*M*– the maximum number of episodes that defines the maximum number of episodes the agent will perform (this is the stopping criterion preventing the

*RL*infinite loop);

*LT*– learning threshold and

*LR*– learning rate.

*LR*is the sum of all the learning updates |

*Q*(

*x*

_{t+}_{1}

*, a*

_{t+}_{1})

*– Q*(

*x*)

_{t}, a_{t}*|*in one episode, as shown in Equation (8). If

*LR*is below some predefined threshold named

*LT*, then the

*RL*should stop learning: The nRL pseudo code is presented in Algorithm 2:

(1) Divide the inflow into *N* episodes for each year.

(2) Discretize the reservoir inflow *q _{t}* into

*L*intervals, making

*L*intervals centers

*q*(

_{lt}*k*

*=*1, 2…

*, L*).

(3) Discretize storage *s _{t}* in

*m*intervals, making

*m*discretization levels

*s*(

_{it}*i*

*=*1, 2

*, …, m*).

(4) Set initial variables: *α*, *γ*, maximum number of episodes – *M,* learning threshold – *LT*.

(5) Set *T* as period that defines the number of time steps *t* in episode (in our case 52 for weekly and 12 for monthly).

(6) Set *LR**=**0*.

(7) Set *n* = 1 (number of an episode).

(8) Set *t* = 1 (time step of a period).

(9) Define initial state *x _{t}* with selecting a starting reservoir volume

*s*.

_{it}(10) Get the reservoir inflow *q _{lt}* and

*t*from the current episode.

(11) Select action *a _{t}* (exploration, or exploitation) and make transition

*x*

_{t+}_{1}.

(12) Calculate the reservoir release *r _{t}* based on

*x*,

_{t}*x*

_{t+}_{1}

*, q*and Equation (4).

_{kt},(13) Execute the nested optimization with distributing the reservoir release *r _{t}* between water demand users using linear or quadratic formulation, calculate the deficits and other objectives, and calculate the reward

*g*(

*x*

_{t}, x_{t±}_{1}

*, a*).

_{t}(14) Calculate the state action value *Q* (*x _{t}, a_{t}*).

(15) Calculate learning update |*Q* (*x _{t}*

_{+1}

*, a*

_{t}_{+1})

*– Q*(

*x*)| and add it to

_{t}, a_{t}*LR*.

(16) *t**=**t**+* 1 and move agent to state *x _{t+}*

_{1}.

(17) If *t**<**T* then go to step 10.

(18) If *t**=**T* then *n**=**n**+* 1.

(19) If *n* < *N* then set new episode data and go to step 8.

(20) If *n**=**N* and *LR**>**LT* then go to step 6.

(21) If *n**=**N* and *LR**<**LT* then stop.

(22) If *n**=**M* then stop.

The main nRL feature is step 13 that executes the nOAA. The nRL design can support the additional state variables, as will be demonstrated in the case study implementation presented in the following sections.

### Approaches to nested optimization of step-wise resource allocation

The nOAA are the same as in Delipetrev *et al.* (2015) where two methods are used to optimally allocate the total reservoir release *r _{t}* between

*n*water users: simplex method in the case of linear problem and weighted quadratic deficit for non-linear problem. Each water user is described with its demand

*d*and corresponding weight

_{it}*W*at time step

_{it}*t*. For the nested optimal allocation, the following variables are relevant:

*d*…

_{1t}, d_{2t}*d*are users' demands;

_{nt}*W*are the corresponding demands' weights;

_{1t}, W_{2t}…W_{nt}*r*is the reservoir release;

_{t}*r*are the users' releases;

_{1t..}r_{nt}*v*is the release discretization value.

*r*can satisfy the aggregated demand of all users in Equation (9):

_{t}If the release *r _{t}* can satisfy the aggregated demand of all users, then the optimal allocation is not performed since all the releases can be set to their demands.

#### Linear method

Minimization of the optimization problem is performed on the release variable *r _{it}*.

#### Non-linear method

*r*is assumed to be discretized in

_{t}*v*levels; this value is set at the beginning and stays the same over nDP execution. The quadratic OF is to minimize: with the same constraints previously described in Equations (11)–(13). Again, the minimization is performed on the release variable

*r*.

_{it}## CASE STUDY

There are five towns in this region (Kratovo, Probishtip, Zletovo, Shtip and Sveti Nikole) and two large agriculture irrigation regions named upper and lower zone. The region is characterized by mountainous topography; the system is also designed to include several small hydropower plants, most of which are located downstream of the Knezevo reservoir as derivational power plants that utilize the natural head differences created by the topography. The hydropower system is still in development and according to the feasibility study report (GIM 2010) the plan is to build eight hydropower plants.

The GIM (2008) report contains a detailed hydrological model of the Zletovo river basin with four river flow measurement points on the River Zletovica. The monthly time series data of the river flow measurement points from the year 1951 to 1991 are used in this research. There is a significant tributary inflow between the Knezevo reservoir and the last branching point. First, the tributary inflow is used to satisfy the water demand objectives, and if additional water quantities are needed then they are released from the reservoir.

The *r _{it}* represent each water user release quantity. The numbering (

*r*to

_{3t}*r*) is selected to fit the optimization formulation in which objectives related to reservoir water level are numbered with indexes 1 and 2, as will be shown below. The solid line represents the main Zletovica river, and the significant left tributary. Some of the presented variables are further explained in the following subsection.

_{7t}The hydro-system is modelled in a lumped way such that water from the tributary inflow is first allocated to all users, and after that the reservoir releases are used to satisfy the remaining user demands. The tributary inflow is calculated as a difference between the last river measurement point *q _{3t}* and reservoir inflow

*q*. The analysis proved that this assumption holds and it is possible to consider tributary inflow as the total water quantity available to all users. This approach decreases the number of system variables characterizing the water users (these two variables are used however for some hydropower calculations). In this system, the main water users are the towns Shtip and Sveti Nikole and both agricultural zones.

_{t}The maximum water level in the Knezevo reservoir considered for this study is H_{max} = 1,061.5 m amsl, which is, in fact, the normal operational level. This level corresponds to max storage volume of V_{max} = 23.5 × 10^{6} m^{3}. The minimum storage volume (dead storage) in the Knezevo reservoir is V_{min} = 1.50 × 10^{6} m^{3} corresponding to H_{min} = 1,015.0 m amsl water level, leaving effectively 22.0 × 10^{6} m^{3} of storage volume in the Knezevo reservoir for balancing available inflows with downstream water demands.

### Formulation of the optimization problem

*w*is the objective weight for a given objective

_{it}*i*and time step

*t*and

*D*is the difference between the target value and decision variable for a given objective

_{it}*i*and time step

*t*.

*min l*and

_{t}*max l*are the recreation and flood water level targets. The purpose of the recreation level target is to preserve a minimum reservoir volume, especially as additional storage volume in case of emergency. The flood level target prevents the reservoir reaching the reservoir spill height and avoids possible uncontrolled spills.

_{t}*d*,

_{3t}, d_{4t}*d*,

_{5t}*d*,

_{6t}*d*. The objectives deficits are calculated using Equation (18): Here,

_{7t}*r*is the release (decision variable) for a given objective (

_{it}*i*) and time step (

*t*).

*w*as the hydropower energy production weight and

_{8t}*D*is calculated from: where

_{8t}*h*is the hydropower target demand and

_{t}*p*is the hydropower production.

_{t}_{0}–HEC

_{3}and HEC

_{6}are dependent on the reservoir operation and they are the only ones considered in the optimization. HEC

_{0}is positioned at the Knezevo reservoir and the entire reservoir release

*r*goes through the turbines of HEC

_{t}_{0}. The reservoir release

*r*is compared with the generator maximum water capacity HEC

_{t}_{0max}, as in Equation (20): The

*r*is the reservoir release water quantity that goes over the turbines of HEC

_{0t}_{0}. The energy generated by HEC

_{0}(MWh) is: where

*gen*

_{0}is the coefficient that includes all conversion coefficients and total efficiency and

*h*and

_{t}*h*

_{t+}_{1}are the reservoir levels in time step

*t*and

*t*+ 1.

Other HEC are calculated in the same way as HEC_{0}, with the very important notion that HEC_{2} and HEC_{3} are using *q _{1t}* and

*q*variables. All HEC coefficients are taken from GIM (2010). All the hydropower plants together produce the total energy

_{2t}*p*.

_{t}The action vector *a _{t}* consist of six actions or decision variables: s

*,*

_{t+1}, r_{3t}, r_{4t}*r*,

_{5t}*r*,

_{6t}*r*which are the next optimal reservoir state and water user releases at each time step. Using these decision variables, it is possible to calculate all other variables and OFs.

_{7t}## NSDP AND NRL ALGORITHMS IMPLEMENTATION

### nSDP application to the case study

The nSDP was adjusted to accommodate the case study optimization problem formulation presented in the section ‘Case study’. The main implementation issue in applying nSDP is how to include the four stochastic variables *q*, *q*_{1t}*, q*_{2t} and , as shown in Figure 2. There is no example of numerical solution of SDP with four stochastic variables without provoking the curse of dimensionality. Perhaps it is possible to design it mathematically, but the practical implementation will probably be very difficult and impractical.

The alternative approach is to investigate the correlation between the reservoir *q _{t}* and the tributary inflow. The correlation coefficient between these two variables is about 0.9 on weekly data. The high correlation gives the opportunity to include the tributary inflow as another stochastic variable in the nSDP algorithm. If this were not the case (low correlation coefficient), then another approach would be needed. The nSDP with the two stochastic variables can be implemented only if the reservoir

*q*and tributary inflow belong to the same cluster at each time step. It is worth noting that the high correlation coefficient typically suggests that the values of both variables belong to the same cluster interval at each time step over the entire modelling period.

_{t}The correlation analysis between reservoir inflow *q _{t}* and tributary inflow bring us to a possible solution to discard other stochastic variables

*q*

_{1t}and

*q*

_{2t}and simplify the optimization problem formulation. The stochastic variables

*q*

_{1t}and

*q*

_{2t}are only used in calculation of the hydropower OF, and do not affect other OF. The consequence of optimization problem simplification and adjustment is the impossibility to calculate HEC

_{2}and HEC

_{3}power production (and the total hydropower production as well) using nSDP. Therefore, the hydropower aspect is not included in nSDP.

Algorithm 3 adds and changes several steps of Algorithm 1 to implement the Zletovica case study and is shown below:

(1a) Discretize the tributary inflow into L intervals i.e., .

(2a) Create the transition matrices *TM* that describe the transition probabilities of tributary inflow *q ^{Tr}*.

(7) Set reservoir inflow and tributary inflow cluster *l* = 1 (for time step *t*) (the reservoir and tributary inflow clusters are the same).

(7a) Distribute the tributary inflow using nested optimization between water demand users and calculate their remaining deficits.

(9) Execute the nested optimization algorithm to allocate the total release to all users {*r*_{3t}*, r*_{4t}*, r*_{5t}*, r*_{6t}*, r*_{7t}} in order to meet their remaining deficits and calculate D1, D2, D3, D4, D5, D6, D7 and D8.

Algorithm 3 has additional steps (1a) and (2a) that are added after Algorithm 1 steps (1) and (2) correspondingly, and these steps calculate the tributary inflow variable. Both variables, the reservoir inflow *q _{t}* and tributary inflow , are discretized using K-means algorithm. Algorithm 3 step 7 replaces Algorithm 1 step 7, where the same cluster is set for the reservoir and tributary inflow. Steps (8a), (9) and (9a) are adding/replace steps as described previously.

### nRL application to the case study

The nRL includes all case study variables (*, q*_{1t} and *q*_{2t}), and implements the optimization problem formulation as described in the case study section. The nRL executes multiple episodes with deterministic variables time series data, where each episode is 1 year. The nRL implementation is very specific for this ORO problem described in the case study. That is why often designing and implementing RL (and other machine learning techniques) is an art, because the modellers construct the entire system, define variables, states, actions, rewards, etc.

The primary design decision in the nRL (and RL in general) is to determine the state, the action and the reward variables. Three different approaches to define states *x _{t}* were tested: (1)

*x*

_{t}*=*{

*t, s*}, (2)

_{t}*x*

_{t}*=*{

*t, s*}, (3) . The nRL action and reward were the same in all three approaches. Тhe action

_{t}, q_{t}*a*is described with the next storage volume

_{t}*a*

_{t}*=*{

*s*

_{t+}_{1}} and consequently ‘nested’ releases

*a*

_{t}*=*{

*s*

_{t+}_{1}

*, r*

_{t}, r_{3t}

*, r*

_{4t}

*, r*

_{5t}

*, r*

_{6t}

*, r*

_{7t}}. The reward

*g*(

*x*

_{t}, a_{t}, x_{t+}_{1}) is defined as an optimization problem formulation or Equation (15). The only difference is that deviation is with a negative sign and the nRL OF is to maximize negative deviation. The maximal gain is 0 when the objective is satisfied.

As mentioned before, the action can be described as the reservoir release *a _{t}*

*=*{r

*}. This does not make any conceptual difference considering equations, since the next state*

_{t}*s*

_{t+}_{1}can be calculated from the mass balance equation, but it is more complicated to implement. In our case, the reservoir storage

*s*and reservoir inflow

_{t}*q*are discretized, and the evaporation

_{t}*e*is calculated using

_{t}*s*and

_{t}*s*. If a reservoir release action

_{t+1}*r*is selected, then based on the mass balance equation the calculated next reservoir volume

_{t}*s*

_{t+}_{1}will fall in between two discretized storage volumes.

Instead, it is much more convenient and easier to implement the next reservoir volume as action *a _{t}*

*=*{

*s*

_{t+}_{1}}. In that case, the start and next discrete storage volumes

*s*and

_{t}*s*, and discretized reservoir inflow

_{t+1}*q*are defined, and the evaporation

_{t}*e*and the reservoir release

_{t}*r*can be easily calculated.

_{t}The state space grows exponentially with the additional state variables. The state space directly influences the computational time and the agent's ability to learn. However, the action space stays the same due to the ‘nested’ methodology. A third approach was used, that increased the state vector dimension to about 94,900 cells (52 weeks × 73 reservoir level × 5 reservoir inflow discretization × 5 tributary inflow discretization). Because the agent explores/exploits the possible actions over the modelling period, it is very likely that some of the *Q* (*x _{t}, a_{t}*) in the matrix will be unused. The solution selected for dealing with this issue was to use the HashMap function supported in Java. Both nSDP and nRL are developed in Java. The nRL implementation pseudo code on the case study is shown in Algorithm 4 below:

(2a) Discretize the tributary inflow into *L* intervals i.e., .

(9) Define initial state *x _{t}* with an initial reservoir volume

*s*, read the reservoir

_{t}*q*, tributary inflow cluster value ,

_{kt}*q*and

_{1t}*q*from the current

_{2t}*episode*, and the time step

*t*.

(12a) Distribute the tributary inflow using nested optimization between water demand users and calculate their remaining deficits.

(13) Execute the nested optimization with distributing the reservoir release *r _{t}* between water demand users {

*r*

_{3t}

*, r*

_{4t}

*, r*

_{5t}

*, r*

_{6t}

*, r*

_{7t}} satisfying the remaining deficits, calculate D

_{1}, D

_{2}, D

_{3}, D

_{4}, D

_{5}, D

_{6}, D

_{7}and D

_{8}, and calculate the reward

*g*(

*x*

_{t}, x_{t+}_{1}

*, a*).

_{t}Algorithm 4 steps (2a) and (12a) are added after Algorithm 2 steps (2) and (12). Steps (9) and (13) of Algorithm 4 replace the same step from Algorithm 2.

## EXPERIMENTAL SETTINGS

The available 55 years' weekly data are separated into two parts: (1) training and (2) testing. The data from 1951 to 1994 (2,340 time steps) are used for training and 1994–2004 (520 time steps) for testing. The nSDP training data consist of reservoir inflow *q _{t}* and tributary inflow in the previously mentioned period. The nRL training data consist of reservoir

*q*and tributary inflow , and the two other flows

_{t}*q*and

_{1t}*q*are used for hydropower calculation. The nSDP and nRL data for minimum and maximum levels, water supply, irrigation demands, ecological flow and hydropower are set to the 2005 weekly data presented in the case study section, and they are the same in the training and testing periods. The reservoir operation volume is discretized in 73 equal levels (300 × 10

_{2t}^{3}m

^{3}). The minimum level was set at 1,021.5 m amsl and the maximum level at 1,060 m amsl. The weights applied in these experiments are shown in Table 1. At the beginning, the nested optimization algorithm (linear or quadratic) and the number of clusters (in our case five) are selected in both nSDP and nRL. Both nSDP and nRL have the same experimental settings.

Experiments | w_{1} | w_{2} | w_{3} | w_{4} | w_{5} | w_{6} | w_{7} | w_{8} |
---|---|---|---|---|---|---|---|---|

nDP-L_{5,} nDP-Q_{5}; nSDP-L_{5}, nSDP-Q_{5}; nRL-L_{5}, nRL-Q_{5} | 2,000,000 | 2,000,000 | 200 | 1 | 200 | 1 | 300 | 0.01 |

Experiments | w_{1} | w_{2} | w_{3} | w_{4} | w_{5} | w_{6} | w_{7} | w_{8} |
---|---|---|---|---|---|---|---|---|

nDP-L_{5,} nDP-Q_{5}; nSDP-L_{5}, nSDP-Q_{5}; nRL-L_{5}, nRL-Q_{5} | 2,000,000 | 2,000,000 | 200 | 1 | 200 | 1 | 300 | 0.01 |

The main OF combines three distinct objective types: the minimum and maximum reservoir critical levels that are measured in m, the water user demands that are measured in 10^{3} m^{3}/per time step (week or month) and the hydropower energy production that is measured in MWh/per time step (week or month)*.* This is the main reason why the weights have different magnitudes. A similar approach is taken in other previous research studies (e.g., Pianosi & Soncini-Sessa 2009; Rieker 2010; Quach 2011).

The weights are set according to the objective importance and create the ORO policy. The most important objective is the environmental flow (*w _{7}*) followed by cities' water demands (

*w*and

_{3}*w*), agriculture demands (

_{5}*w*and

_{4}*w*), and lastly hydropower production (

_{6}*w*). The hydropower weights were set extremely low for two main reasons: (1) the hydropower objective by the reports is considered as a by-product from reservoir operation and not its main feature; and (2) to lower as much as possible the influence of hydropower in ORO, and have a valid ground in comparing nSDP and nRL.

_{8}The nDP results of testing data are ‘the optimal operation’, meaning that nDP is a deterministic optimization and calculates the ORO. The nSDP and nRL, on the other hand, are trained on training data, producing a policy, and have not seen the testing data. The nDP results are used as a benchmark for the nSDP and nRL policies. The closer the policies derived by nSDP and nRL are to nDP, the better they are.

The algorithms are additionally labelled to denote the deficit formulations used in the nOAA. For example, nDP-L_{5} stands for nDP using the linear deficits' formulation, and nDP-Q_{5} stands for nDP using the quadratic deficits' formulation. The nRL parameters at the beginning are set at: *α*_{0} = 0.8, *γ* = 0.5 and *ɛ* = 0.8. The parameter *α* is set to decrease linearly with the number of episodes. The maximum number of episodes is set to *M* = 400,000. There were various approaches tested for decreasing *ε*, and the one used in the experiments is decreasing *ε* on each 100,000 episodes by half. Starting at *ε*_{0} = 0.8 and with increasing the number of episodes the agent is making less exploration and more exploitation actions, and insuring convergence to the optimal solutions.

## RESULTS AND DISCUSSION

_{5}agent learns the optimal policy and gets closer to nDP-L

_{5}ORO as the number of episodes increases, as shown in Figure 3. However, after a large number of episodes, the learning slightly deteriorates. The nRL-L

_{5}agent optimal reservoir policy is poor at 30,000 episodes, as shown in Figure 3. From 50,000 to 80,000 episodes, the nRL-L

_{5}policy improves, and between 80,000 and 160,000, the policy is the closest to the nDP-L

_{5}ORO. After 250,000 episodes the policy slightly deteriorates. The possible reason for this is overtraining, which is a known issue when using machine learning algorithms.

_{5}reservoir volume at time

*t*and is the reservoir volume of nRL-L

_{5}at time

*t*.

The absolute difference between the nRL-L_{5} and nDP-L_{5} optimal reservoir volumes can be used as the stopping criterion. Obviously, the nRL-L_{5} optimal reservoir policy performs best between 80,000 and 160,000 episodes of training. Afterwards, the policy somewhat deteriorates, although it is still relatively good.

_{5}has a very limited minimum level violation, while nSDP-L

_{5}and nRL-L

_{5}have significant minimum level violation, as shown in Figure 4. The reason for this behaviour is that nDP-L

_{5}is a deterministic optimal algorithm that has perfect knowledge (forecast) of the future. When nDP-L

_{5}is applied to the 1994–2004 testing data, it means that an exhaustive search is performed to find the optimal deterministic solution. Therefore, nDP-L

_{5}as shown in Figure 4, performs an optimization policy in the period 1999–2001 that is leaving the reservoir on the minimum level only for a very short period of time. On the other side, both nSDP-L

_{5}and nRL-L

_{5}train/learn on training data and produce policies. The nSDP-L

_{5}and nRL-L

_{5}policies are assessed on the testing data. The training data 1951–1994 are the combination of wet, average and dry years that are fed to nSDP-L

_{5}and nRL-L

_{5}algorithms. Based on these data, both algorithms derive the optimal policy, which is a universal 1-year policy (for wet, average and dry years). Due to that, the nSDP-L

_{5}and nRL-L

_{5}performance in the period 1999–2001 leaves the reservoir empty for a longer period, and it is noticeable that nRL-L

_{5}performs better than nSDP-L

_{5}. The overspills are considered and calculated and can be partially observed in the periods May 1999 and 2000 in Figure 4. The nDP-Q

_{5}, nSDP-Q

_{5}and nRL-Q

_{5}results are similar to the nDP-L

_{5}, nSDP-L

_{5}and nRL-L

_{5}correspondingly.

_{5}, nSDP-L

_{5}, nRL-L

_{5}, nDP-Q

_{5}, nSDP-Q

_{5}and nRL-Q

_{5}optimization results and comparison of the sum of minimum level (D

_{1}) and maximum level (D

_{2}) deviations, sum of users' deficit (D

_{3-7}) and sum of hydropower deficit (

*D*) in the testing period 1994–2004 are shown in Figure 5.

_{8}Figure 5 results show that nDP-L_{5} and nDP-Q_{5}, which are the target, have very low D_{1} and D_{2} deviation, and that nRL-L_{5} and nRL-Q_{5} are better than nSDP-L_{5} and nSDP-Q_{5}. The same applies in D_{3}–D_{7} deviations, while in D_{8} nSDP-L_{5} and nSDP-Q_{5} are not calculated. The nRL-L_{5} and nRL-Q_{5} produce better ORO policies than the nSDP-L_{5} and nSDP-Q_{5}, as shown in Figure 5.

The overall results shown in Figure 5 correspond to the results shown in Figure 4, where it is obvious that the nRL policy performs better than nSDP, and it is closer to the target set by nDP.

## CONCLUSIONS

The paper presented nSDP and nRL novel ORO algorithms that can solve problems with multiple decision variables, successfully alleviating the curse of dimensionality. These algorithms were implemented and tested in the case of the Zletovica hydro-system with eight objectives and six decision variables.

The nSDP has issues in implementing several state variables without provoking the curse of dimensionality, thus adjustments were needed to fit nSDP to the case study optimization problem requirement. The nRL showed its true power with including all four stochastic variables implementing the complete optimization problem formulation, but its implementation and tuning requires additional effort. The main conclusion from the implementation of the algorithms is that nDP can implement complex optimization problem formulations without significant problems. The nSDP has limitations when additional optimization problem variables are included. The nRL is very powerful in implementing complex optimization problems, but needs tuning concerning its design, parameters, action list, convergence criteria, etc.

The presented nSDP and nRL and their implementation in the case study of the Zletovica river basin confirmed that in some situations the curse of dimensionality and computational complexity can be overcome. There could be situations where, for example, in applying nSDP and the correlation between the reservoir and tributary inflow is low, this proposed solution is not applicable. This restricts the possibility of nSDP application to a subset of reservoir problems. In this particular case, the two stochastic variables' approximation in nSDP was the best approach. In any case, the nSDP algorithm is limited in few stochastic variables before breaking the curse of dimensionality and becoming computationally unsolvable. On the other hand, nRL demonstrated its full capacity in including multiple stochastic variables and solving this problem and, at least on the conceptual level, can be applied to much more complex single and multireservoir problems.

The nSDP and nRL were used to derive 1-year weekly optimal reservoir policy. The available weekly data (1951–2004) were divided into a training (1951–1994) and testing part (1994–2004). The nSDP and nRL optimized/learned the optimal reservoir policy on training data, and their policy was examined on the testing data. The nDP solved the ORO problem in the testing period (1994–2004) and this solution was used as a target for both nSDP and nRL. Interesting results were to observe how the nRL agent learns with the increase of the number of episodes. The nRL optimal reservoir policy is best between 80,000 and 160,000 learning episodes. The nSDP and nRL policies were benchmarked against the nDP results and it was found that the nRL performs better than nSDP overall and for all objectives separately. The main conclusion is that the nRL is a better choice than the nSDP, at least for the considered case study.

The presented nSDP and nRL algorithms are successfully tested on a relatively complex single ORO problem, as the Zletovica case study is. Generally, nSDP cannot be applied in two and more multireservoir systems because of the curse of dimensionality. On the other hand, nRL supports several stochastic variables, as demonstrated in this case study, and in our opinion could be a potential solution for multireservoir ORO problems. However, as stated previously, the nRL implementation is difficult and highly problem specific.

The developed nested algorithms are computationally efficient and can be run on standard personal computers. For the considered case study, on a standard PC, nDP executes in 1–3 min, nSDP in 2–5 min, while nRL is 8–15 min (the longest is nRL-Q).

The ORO is a multi-objective problem by its nature because often different objectives (water demands, hydropower and reservoir levels) are concerned. In this research, it was first reduced to a single objective optimization problem by employing the single-objective aggregated weighted sum function. It is possible to make several single-objective optimization algorithms that are executed multiple times with several weight sets, i.e., multi-objective optimization by a sequence of single-objective optimization searches. This method can be applied to nDP, nSDP and nRL, which will create fully fledged multi-objective algorithms. Future research will focus on designing and developing multi-objective variants on the nDP, nSDP and nRL producing Pareto set.