Stormwater basins are important stormwater control measures which can reduce peak flow rate, mitigate flooding volume, and improve water quality during heavy rainfall events. Previous control strategies for stormwater basins have typically treated water quality and quantity as separate objectives. With the increasing urban runoff caused by climate change and urbanization, current single-objective control strategies cannot fully harness the control potential of basins and therefore require improvement. However, designing multi-objective control strategies for basins is challenging because of the conflicting operation goals and the complexity of the dynamic environmental conditions. This research proposes a novel real-time control strategy based on deep reinforcement learning to address these challenges. It employs a deep Q-network to develop an agent capable of making control decisions. After being trained on three different rainfall events, the reinforcement learning agent can make appropriate decisions for previously unseen rainfall events. Compared to the other two rule-based control scenarios and a static state scenario, the deep reinforcement learning method is more effective in terms of reducing total suspended solids, reducing peak flow, minimizing outflow flashiness, and controlling effort, striking a good balance between conflicting control objectives.

  • Deep reinforcement learning is effective in stormwater basin control with multiple objectives.

  • There is no direct correlation between training reward and performance.

  • A well-trained agent can adjust to unseen rainfall events.

A stormwater basin, or stormwater detention pond, is a man-made facility designed to capture and manage the excess water generated during heavy rainfall. As surface storage facilities, they serve as a crucial tool for controlling stormwater runoff through attenuation and purification. They temporarily hold stormwater during rainfall, allowing water to infiltrate into the ground slowly or be released back into the environment at a controlled rate, subsequently reducing peak flows and flood volumes (Griffiths 2017). Additionally, these basins promote the deposition of particulate pollutants, thereby improving stormwater quality (Haan et al. 1994; Alam et al. 2018). Consequently, they represent a viable solution to the multitude of challenges posed by urban stormwater management. Typically, real-time control (RTC) devices are integrated at the outlets of basins, enabling them to adjust the water discharge rate in response to external conditions, thereby significantly enhancing the control effectiveness.

Although basins have multiple optimization capabilities, former basins' control usually employed single-objective strategies. Some focused solely on water quantity control, aiming to minimize overflow, mitigate peak flows, or decrease flow velocities downstream (Park et al. 2014; Schmitt et al. 2020). Others concentrate on water quality improvement, targeting reductions in total suspended solids (TSS) or other pollutants (Middleton & Barrett 2008; Lacour et al. 2011; Gaborit et al. 2013). While some of these control objectives are conflicting (for instance, lowering pollutant concentrations needs to close the valve to settle down which will lead to increased water levels and overflow in the basin), others can be reconciled (e.g., reduction in peak flow often enhances pollution removal efficiency). Consequently, in most cases, a single-objective control approach is not the optimal solution for overall control effectiveness, implying that it does not fully maximize the basin's potential.

In light of climate change, extreme precipitation events are becoming more frequent. The escalating urbanization drives an expansion of impermeable surfaces across urban landscapes. As a result of the combined influence of climate change and urbanization, the magnitude of urban stormwater runoff is increasing. This burgeoning runoff not only releases substantial volumes of pollutants into waterways (Brombach et al. 2005), but also stands as a primary culprit behind urban flooding, erosion of water bodies, rapid peak flows, and hydraulic disturbances in receiving streams (Borchardt & Sperling 1997). Given these mounting challenges, basins, serving as essential elements in urban rainwater management, need to adapt by shifting from single-objective control to multi-objective control to effectively manage the escalating urban runoff pressures.

Transferring from single-objective control to multi-objective control demands more sophisticated control strategies. Rule-based control strategies (i.e., if the water level exceeds a certain level, adjust the outlet rate to certain points) are commonly applied to RTC basins with single objective. These rules are designed by human experience. They are stable and easy to implement. But there are several disadvantages, for example, first, rule-based control algorithms are not dynamic enough to adapt to various rainfall events. Second, the control rules are generalized from human experience, which does not provide any guarantee of optimality. Third, it is very difficult to design suitable rules for basins with competing control objectives which is common in the real world (Zhang et al. 2023). Thus, the rule-based control method is not suitable for multiple objectives control of basins.

To address these limitations, this study explores the application of deep reinforcement learning (DRL) as a novel control strategy. DRL is a category of machine learning that aims to learn from trial-and-error experience through interactive learning with the environment (Fayaz et al. 2021). With DRL, the control strategies can be learnt, instead of using predetermined rules (Sutton & Barto 2018). DRL can be combined with neural networks as function approximators which increases the flexibility to optimize control actions and has the potential to continually adapt operations to evolving environmental conditions (Lillicrap et al. 2015). With the advances of deep learning, DRL has successfully automated a variety of complex and human-level tasks, such as the game of Go (Silver et al. 2017), StarCraft II (Vinyals et al. 2019), matrix multiplication (Fawzi et al. 2022), optimal path finding for multi-arm manipulators (Suleiman et al. 2020), combined sewer control (Zhang et al. 2023), and building energy control (Wang & Hong 2020).

Former researchers in the stormwater management field also applied DRL to generate control strategies of urban drainage systems and achieved desirable outcomes (Mullapudi et al. 2020, 2018; Tian et al. 2022, 2023a, 2023b; Zhang et al. 2022). However, their control objectives are relatively simple, only considering flood mitigation and peak flow reduction without consideration of water quality and control effort. Despite the inherent compatibility between these control objectives, where flow velocity typically scales linearly with water depth, the management of this multi-objective challenge remains manageable. Nevertheless, as urban stormwater runoff escalates, the importance of incorporating parameters like pollution reduction effectiveness and the mitigation of flash floods cannot be overlooked. The increase in control objectives introduces a new level of complexity, leaving the suitability of reinforcement learning (RL) for tackling such intricate tasks an open question, warranting further investigation. Besides, compared to other similar engineering fields such as building heating, ventilation, and air conditioning control and electric grid control, the potential of DRL on stormwater basins control is still yet poorly explored. It remains unclear how DRL will perform for the multiple objectives control of stormwater basins and how to improve the performance effectively.

This study applied DRL technology to generate control strategies for multi-objective control in a stormwater basin in a real-world watershed to maximize its pollution removal efficiency, overflow mitigation, peak flow reduction, and minimize control times and outflow flashness. The agent is trained based on a 12-h 2-year return period storm, 5-year return period storm, and a 20-year return period storm. It is then tested on a 12-h 10-year return period storm and a recorded real rainfall event.

This section delves into the development, execution, and assessment of a DRL control strategy for managing multiple objectives in stormwater basins. The approach begins with the creation of a physics-based model to simulate the hydrodynamic and pollutant removal dynamics through a simulation software called the stormwater management model (SWMM). Then, a DRL agent is created using Python which can interact with the SWMM. The DRL algorithm is subsequently trained using synthetic rainfall data, ensuring its adaptability to diverse scenarios. The trained algorithm is then subjected to rigorous testing with both newly designed and real-world rainfall events. Comparative evaluations are conducted against three alternative control methods: water level-based control, pollution-focused control, and a static state strategy, to demonstrate the efficacy of the DRL approach.

Study area and rainfall events

This study is carried on an urban watershed located in Michigan, the USA. The watershed is approximately 4 km2 in area with several stormwater basins. Figure 1 shows the overview of this area and the studied basin. This area was chosen because it contains 11 stormwater basins, which play a crucial role in the local stormwater management. Besides, there is already a calibrated model which has been used and verified in several published papers. The study's focal point is the central basin, which is flanked by upstream and downstream components. The basin features a depth of 4.18 m and a total volume of approximately 659 m3, equipped with a controllable outlet at its base to discharge water downstream. The SWMM is employed to simulate the intricate processes of hydraulic and pollution removal within the basin. It is a popular hydrologic-hydraulic computational model that has been successfully used in the planning, analysis, and design of urban drainage systems (Avellaneda et al. n.d.; Cantone & Schmidt n.d.). For this research, model parameters are adopted from a calibrated model described in a previous study (Bartos et al. 2018).
Figure 1

Overview of the stormwater system and studied basin.

Figure 1

Overview of the stormwater system and studied basin.

Close modal
The rainfall data used for training and testing in this study are designed based on the Chicago histogram. The function is as follows:
(1)
The study employs three 12-h storms with return periods of 2 years, 5 years, and 20 years for training, and tests with a 10-year return period storm and a genuine recorded rainfall event. The corresponding hyetographs are depicted in Figure 2.
Figure 2

Hyetograph of the training and testing rainfall.

Figure 2

Hyetograph of the training and testing rainfall.

Close modal

Control objectives and evaluation metrics

Since this area traditionally served as a major focal point in the city's strategy to combat flooding and reduce runoff-driven water quality impairments (Bartos et al. 2018), with rising water flow issues, enhancing multi-objective rainwater management in the area is crucial for sustainable water use and environmental protection. In this study, the control objectives of the basin are:

  • (1) Maximize the detention time of the captured water to improve pollutant removal.

  • (2) Minimize overflow volume to mitigate flooding.

  • (3) Minimize peak outflow to reduce erosion in downstream riverbeds.

  • (4) Discharge the basin in a smooth way to minimize the hydraulic shocks induced to receiving water bodies and to minimize re-suspension.

  • (5) Minimize the frequency of orifice operations in order to prevent excessive wear and tear on the actuator.

The corresponding performance evaluation metrics are (a) overflow volume, (b) peak flow rate, (c) downstream cumulative TSS load, (d) control effort, and (e) outflow flashness. These metrics are defined mathematically in Table 1.

Table 1

Performance evaluation metrics and quantitative performance measures

Performance criteriaQuantitative performance measure
Overflow  
Peak outflow max(
Cumulative TSS load  
Control effort  
Outflow flashiness  
Performance criteriaQuantitative performance measure
Overflow  
Peak outflow max(
Cumulative TSS load  
Control effort  
Outflow flashiness  

Formulation as reinforcement learning problem

The general idea of DRL control is to train an agent through interacting with an environment. By taking certain actions and receiving rewards or penalties in return, the agent learns how to behave better in a certain environment (Wang et al. 2021). The learning process is as follows: firstly, the agent initializes its knowledge (often randomly). It then interacts with the environment, observes the outcomes, and updates its knowledge based on the rewards it receives. The agent uses this knowledge to improve its policy over time. There are several important elements of RL, including agent, environment, states, actions, rewards, policy, value functions, exploration, and exploitation.

In this study, the control object is the outlet orifice, and the cross-section is square with a side length of 0.305 m. The orifice offers 11 opening positions, ranging from 0 to 100% in 10% increments. The outlet flow rate is subject to the water level and opening percentage. The equation is as follows:
(2)

is the opening percentage of the orifice; is the orifice discharge coefficient, A is the cross-sectional area of the orifice; g is the gravitational acceleration; and is the height difference between the orifice and the water surface.

The environment is the SWMM which provides state information at a 5-min simulation time step. The state space includes the current depth (, m), current inflow (m3/s), current outflow (, m3/s), current TSS concentration in the basin (mg/L), current inflow TSS concentration (, mg/L), current opening percentage of controllable orifice, and current rainfall intensity (mm). The action space of the agent is the opening percentage of the outlet orifice. The reward function is designed to reflect how well the agent meets user-specified control objectives, such as avoiding overflow and maximizing pollutant removal. It is set as follows:
(3)
where is the opening ratio of the orifice at time t, , , , and are rewards set corresponding to the control objectives. Avoiding overflow and reducing pollutant are the two most important objectives for a stormwater basin, thus larger weight parameters have been assigned to these two objectives.

If the basin's water level decreases, a positive reward is granted; conversely, if the water level rises, control actions increase, and pollution levels rise, a negative reward is assigned. Overall, all these parameter settings are based on multiple trials, and the current parameters allow the agent to converge during training and learn effective control strategies.

A policy is a mapping from an agent's state or a sequence of states to an action. It represents the agent's decision-making strategy. The policy determines what action the agent should take in a given situation to maximize its long-term reward. A value function estimates the long-term reward an agent can expect to receive by following a particular policy. Exploration is the process by which an agent learns about the environment by taking actions that may lead to uncertain or unexplored outcomes. The goal of exploration is to gather information about the environment to improve the agent's policy. Exploration strategies often involve a balance between choosing actions based on the current knowledge (exploitation) and taking random or seemingly suboptimal actions to discover new information. Exploitation is the act of choosing the action that appears to be the best, based on the current knowledge and the agent's current policy. It is the process of maximizing the immediate reward by taking the action that is believed to yield the highest reward given the current state (François-Lavet et al. 2018; Sutton & Barto 2018; Fayaz et al. 2021). However, exploitation can lead to local optima if the agent does not explore enough.

The fundamental process of RL can be outlined as follows: (1) the agent interacts with the environment, observing its current state. (2) The environment provides a scalar reward signal as feedback in response to the agent's action. (3) The agent employs a deep Q-learning (DQN) algorithm to update its Q-value function based on the Bellman equation which represents the expected future rewards for each state-action pair. The agent then follows an optimal policy (a sequence of actions) to maximize the cumulative reward (Dolcetta & Ishii 1984; Hansen 2016). (4) To mitigate correlation in the training data and stabilize the learning process, experience replay and a target network are employed. Experience replay is a technique which stores past experiences in a buffer and samples them randomly for training. This breaks the correlation between consecutive experiences and helps the model generalize better, improving learning stability by preventing overfitting to recent data. The target network is a delayed copy of the main Q-network, which is updated less frequently than the primary network. This separation provides a more stable and consistent target for updating the Q-values, avoiding the rapid and potentially destabilizing fluctuations that can occur when the Q-values are updated too frequently. By using a target network, the learning process is smoothed, preventing instability in the Q-value updates and ensuring more reliable convergence (Hansen 2016). (5) This process is iteratively repeated with the agent continuously updating its policy based on the learned Q-values and aiming to maximize the cumulative reward over time. (6) The learning terminates once a convergence criterion is met, indicating that the agent's performance has stabilized and it has learned an optimal or near-optimal control policy. The entire learning process is depicted in Figure 3.
Figure 3

RL flowchart.

Hyperparameters setting

In RL, hyperparameters are parameters that are not learned during the training process but are set by the researcher or developer before the learning algorithm begins (François-Lavet et al. 2018). These values influence the behavior and performance of the learning algorithm and tuning them is crucial for achieving optimal results.

The hyperparameters for the DQN agent are shown in Table 2 (Mnih et al. 2015). These values were determined through a combination of empirical testing and references from related literature. Regarding the network structure, 6 represents the number of input nodes, corresponding to the 6 input features (i.e. the current depth, current inflow, current outflow, current TSS concentration in basin, current inflow TSS concentration, and current opening percentage of controllable orifice) used by the agent. The hidden layer consists of 20 neurons, which was chosen to provide sufficient complexity for the model to capture underlying patterns in the data without overfitting. Finally, 11 represents the number of output nodes, corresponding to the 11 possible actions (i.e. 0, 10, 20, … 100% opening percentage) the agent can take. This structure was selected after testing different configurations to ensure that the model has enough capacity to learn effectively while maintaining efficiency in training and decision-making. The learning rate means that the agent integrates merely 0.1% of new information into its learning process. This rate was chosen to balance the need for slow, stable learning while avoiding large updates that could destabilize the agent. The discounted rate γ enables the agent to consider future rewards while maintaining a focus on immediate outcomes. This value was optimized through trials to account for long-term rewards without overwhelming short-term learning. The greedy rate ε means the agent will explore new actions 10% of the time, favoring exploitation of known actions 90% of the time, facilitating a balance between learning from new experiences and leveraging existing knowledge. The replay memory size and batch size were set based on tests aimed at allowing the agent to recall sufficient past experiences while maintaining training efficiency. The agent learned every five time-steps, a value chosen to reduce computational load and optimize learning time.

Table 2

Hyperparameters setting

HyperparametersValue
Network structure 6 × 20 × 11 
Learning rate  0.001 
Discount rate  0.9 
ε-greedy rate (ε0.1 
Replay memory size (N1,000 
Batch size 64 
Learning interval 5 times step 
HyperparametersValue
Network structure 6 × 20 × 11 
Learning rate  0.001 
Discount rate  0.9 
ε-greedy rate (ε0.1 
Replay memory size (N1,000 
Batch size 64 
Learning interval 5 times step 

In this study, DQN is chosen as the approximator. It serves as a key component that approximates the value function. The neural network takes the state of the environment as input and processes it through multiple layers, allowing it to capture complex patterns and non-linear relationships between states and actions. This deep neural network is capable of generalizing across different states and providing an estimate of the optimal Q-value for each possible action. During the training phase, the approximator learns to minimize the difference between its predicted Q-values and the target Q-values, which are calculated based on the agent's current policy and the Bellman equation. This process of adjusting the network's weights through backpropagation enables the approximator to continually refine its estimates and improve the agent's decision-making abilities which enables the agent to make intelligent decisions in complex, sequential decision-making problems.

Control scenarios

This study sets four control scenarios to compare the control performance. The first scenario is a static state scenario. The outlet is kept open totally all the time. The second scenario is a water level-based control. It keeps the outlet closed until the water level in the basin exceeds a certain point. To be specific, if the water level exceeds 2 m, the valve will fully open. If the water level is between 1 and 2 m and there is no inflow, the valve will be half-opened. If the water level is less than 1 m and no inflow exists, the valve will be closed. The third scenario is pollution concentration-based control. If the TSS concentration exceeds 15 mg/L and the water level exceeds 2.5 m, the valve will close. Otherwise, it will fully open. The last one is the DRL control scenario which uses the trained agent to control the outlet orifice. These scenarios are tested on a 10-year return period rainfall event as well as a real rainfall.

The selection of these three control scenarios is based on their prevalence as common control strategies, supported by numerous prior research studies (Gaborit et al. 2013; Shishegar et al. 2021, 2019). By comparing these scenarios, the performance of RL control can be clearly demonstrated, highlighting both its strengths and limitations. However, it is important to note that other advanced rule-based control strategies may exist that could potentially outperform those selected for this comparative analysis.

Once the simulation model and DRL agent were established and integrated, the training and testing process are carried out, followed by a performance comparison of the four control scenarios mentioned above.

Training result

The DRL agent is trained on three different rainfall events: a 12-h 2-year return period storm, a 5-year return period storm, and a 20-year return period storm. The training episode is set as 100 times based on experimental observations. As training progresses, the reward increases and gradually stabilizes, indicating that the algorithm has converged by 100 episodes. The reward progression is illustrated in Figure 4. Initially, the reward curve exhibits fluctuations. Around the 20th iteration, the reward curve started to exhibit a clear upward trend. It reaches its peak around the 90th training episode. This indicates that the agent, after extensive exploration, has discovered a suitable optimization direction. By the 90th training episode, the reward curve starts to level off which indicates that the strategy has become relatively well-established and mature.
Figure 4

Reward training curve of DQN.

Figure 4

Reward training curve of DQN.

Close modal

Test results

Figures 5 and 7 show how the static state system, water level-based control, TSS-based control, and RL control respond to a 10-year designed rainfall and a recorded real rainfall. The performance of four control strategies mentioned above is evaluated in terms of preventing overflow, reducing pollution, reducing shock to downstream river, minimizing peak flow while simultaneously minimizing control effort. In this study, all evaluation metric values are calculated as a relative percentage based on the maximum value for each metric. Note that a smaller area represents better performance.
Figure 5

Basin dynamics results of the designed storm.

Figure 5

Basin dynamics results of the designed storm.

Close modal
Compared to the static control strategy, all other three RTC strategies utilize the basin's capacity more effectively by allowing the basin to fill closer to its maximum height during both the designed storm and the real storm. However, as shown in Figures 5 and 7, the two rule-based control strategies struggle to decrease the outflow rate and minimize valve control effort. Especially for water level-based control strategies, where valves are automatically closed when thresholds are exceeded and opened when they drop below the set point, this leads to frequent valve oscillations over time as shown by the green dashed line in the subgraph ‘OR’. This continuous cycling can result in significant fatigue damage to the valves. Besides, this control strategy results in significant fluctuations in downstream flow velocities. Although it successfully keeps water levels low for peak flood mitigation, the large valve openings result in exceptionally high peak flow rates. As evident from Figures 6 and 8, the total area under the control curve for water level-based control is the largest, indicating its inferior performance even compared to the static state scenario.
Figure 6

Relative performance (%) comparisons of the designed storm.

Figure 6

Relative performance (%) comparisons of the designed storm.

Close modal
Figure 7

Basin dynamics results of the real storm.

Figure 7

Basin dynamics results of the real storm.

Close modal
Figure 8

Relative performance (%) comparisons of the real storm.

Figure 8

Relative performance (%) comparisons of the real storm.

Close modal

In the subgraph ‘outflow’ of Figures 5 and 7, the TSS-based control method generates the highest peak flow due to its operational principle, which involves closing valves until TSS concentrations drop below a threshold. This leads to elevated basin water levels, resulting in significant peak flow upon valve re-opening. However, it can be inferred that a more gradual valve opening strategy, after TSS reaches the threshold, would help to manage peak flow more effectively. As Figure 8 illustrates, the TSS control strategy exhibits better performance in several metrics such as control effort and overflow volume, with its overall area ranked second only to the RL approach.

In this research, we selected 100 training episodes for the DRL agent. To test the robustness of the agent's learning process, we also evaluated control performance after 20 and 50 training episodes. Interestingly, the control results after 20, 50, and 100 episodes were identical, suggesting that the critical behavioral patterns were learned relatively early in the training process. This observation indicates that although the total reward continues to increase with more episodes, the agent's performance stabilizes after a certain point. This is a significant finding, as training time is a key factor limiting the practical implementation of DRL in real-world projects. By reducing the number of required training episodes, this research presents the possibility of significantly shortening the training time for DRL agents, making them more feasible for application in operational stormwater basin control.

For the designed rainfall, the DRL method has the smallest overall area, signifying its superior performance compared to the other three control strategies. For the real rainfall event, while the DRL method's overall area was comparable to the TSS approach, it still demonstrated the smallest area, suggesting its efficacy in striking a balance among multiple control objectives. Figures 6 and 8 illustrate that the DRL control strategy exhibits a straightforward approach, maintaining the valve opening at 10% throughout. Despite this simplicity, the DRL strategy outperforms the others in terms of pollution reduction and four other criteria. However, it is important to note that the DRL strategy did not fully exploit the storage capacity of the basin for pollutant removal. By keeping the valve closed after rainfall and open until the TSS fully settled down, the TSS load could be reduced without compromising other performance metrics. Thus, although the DRL strategy yields satisfactory operational results, it is apparently not the optimal solution. There is still room for improvement which needs further research.

Despite these limitations, the DRL control strategy demonstrated clear advantages over traditional control methods, particularly in balancing conflicting objectives. This provides strong preliminary evidence that RL can be effectively applied to the multi-objective control of stormwater basins.

In this research, we converted the multi-objective optimization problem to a single-objective problem through a weighted sum method. We chose this approach for its simplicity and computational efficiency in our current framework. The main problem of the weighted sum method is that it predisposes what the decision maker's preferences are, which may change with different events or decision-makers. A true multi-objective DRL method such as Pareto-based DRL techniques would allow for more flexibility by producing a Pareto-optimal set from which decision-makers can select based on their preferences at the time (Mossalam et al. 2016; Li et al. 2021).

In the training process, the agent is exploring in a way which can get the biggest reward, but the biggest reward way may not be the best operation way if the reward function does not represent the quantitative objectives accurately. Thus, when determining the reward function, consider how to quantify each reward component, such as time rewards, energy efficiency, and smooth driving, to precisely reflect the agent's performance on the task. Therefore, the design of the reward function should provide clear guidance, enabling the agent to learn the correct behavioral policy efficiently and effectively.

This study simplifies the multi-objective problem into a single-objective one using a weighted sum method. Future research should explore true multi-objective DRL techniques, such as Pareto-based methods, which allow for more flexible and customizable decision-making.

Additionally, while the current research focuses on a single basin, the coordinated control of multiple basins within a watershed remains an open challenge. Extending DRL to multi-basin systems presents new complexities but offers significant potential for more efficient and effective stormwater management (Xu et al. 2022; Zhang et al. 2023).

Lastly, this study is conducted within a simulation environment. Addressing the reality gap is crucial for the application of the trained agent in practical engineering scenarios (Zhao et al. 2020; Rupprecht & Wang 2022). Currently, there are no documented cases of successfully deploying trained DRL agents in real-world engineering applications, making this a highly promising avenue for future research. The successful deployment of DRL in real-world stormwater systems could significantly enhance the resilience and efficiency of urban water management practices.

This research presents a novel approach for controlling stormwater basin with multiple objectives using DRL. Our method involves training a DQN agent to autonomously make control decisions for the basin outlet within a simulation environment. The performance results demonstrate the superiority of our DRL approach over traditional control strategies in terms of reducing peak flow, preventing overflow with minimal control effort, reducing TSS load, and achieving a more optimal balance between these competing objectives. Besides, one key finding is that an increase in training rewards does not necessarily lead to improved performance, even in the early stages of training. This observation suggests that RL agents can achieve optimal behavior relatively early, reducing the time required for training – a critical factor in real-world applications. The findings also underscore the potential of DRL to adapt to varying rainfall scenarios and provide real-time, efficient control solutions for stormwater management.

The result of this research lays a solid groundwork for future endeavors in collaborative control across multiple basins with multiple control objectives. DRL emerges as a promising tool for managing complex collaborative control scenarios in stormwater management, thereby fostering the advancement of intelligent water management practices in urban settings.

This study is financially supported by the National Natural Science Foundation of Jiangsu Province (BK20230099), the National Natural Science Foundation of China (52379061), the fund of National Key Laboratory of Water Disaster Prevention (5240152C2) and the Key Laboratory of Urban Water Supply, Water Saving and Water Environment Governance in the Yangtze River Delta of Ministry of Water Resources (grant no. 2020-3-3).

The authors declare there is no conflict.

Alam
M. Z.
,
Anwar
A. H. M. F.
,
Heitz
A.
&
Sarker
D. C.
(
2018
)
Improving stormwater quality at source using catch basin inserts
,
Journal of Environmental Management
,
228
,
393
404
.
doi:10.1016/j.jenvman.2018.08.070
.
Avellaneda
P. M.
,
Jefferson
A. J.
,
Grieser
J. M.
&
Bush
S. A.
(
n.d.
)
Simulation of the cumulative hydrological response to green infrastructure
,
Water Resources Research
,
53
(
4
),
3087
3101
.
doi:10.1002/2016WR019836
.
Bartos
M.
,
Wong
B.
&
Kerkez
B.
(
2018
)
Open storm: A complete framework for sensing and control of urban watersheds
,
Environmental Science: Water Research & Technology
,
4
(
3
),
346
358
.
doi:10.1039/C7EW00374A
.
Borchardt
D.
&
Sperling
F.
(
1997
)
Urban stormwater discharges: Ecological effects on receiving waters and consequences for technical measures
,
Water Science and Technology
,
36
,
173
178
.
doi:10.1016/S0273-1223(97)00602-1
.
Brombach
H.
,
Weiss
G.
&
Fuchs
S.
(
2005
)
A new database on urban runoff pollution: Comparison of separate and combined sewer systems
,
Water Science and Technology
,
51
,
119
128
.
doi:10.2166/wst.2005.0039
.
Dolcetta
I. C.
&
Ishii
H.
(
1984
)
Approximate solutions of the bellman equation of deterministic control theory
,
Applied Mathematics & Optimization
,
11
,
161
181
.
doi:10.1007/BF01442176
.
Fawzi
A.
,
Balog
M.
,
Huang
A.
,
Hubert
T.
,
Romera-Paredes
B.
,
Barekatain
M.
,
Novikov
A.
,
R. Ruiz
F. J.
,
Schrittwieser
J.
,
Swirszcz
G.
,
Silver
D.
,
Hassabis
D.
&
Kohli
P.
(
2022
)
Discovering faster matrix multiplication algorithms with reinforcement learning
,
Nature
,
610
,
47
53
.
doi:10.1038/s41586-022-05172-4
.
Fayaz
S. A.
,
Sidiq
S. J.
,
Zaman
M.
&
Butt
M. A.
(
2021
)
Machine learning: An introduction to reinforcement learning
. In:
P. Agrawal, C. Gupta, A. Sharma, V. Madaan, & N. Joshi (eds.)
Machine Learning and Data Science: Fundamentals and Applications
,
Hoboken, NJ: John Wiley & Sons
, pp.
1
22
.
François-Lavet
V.
,
Henderson
P.
,
Islam
R.
,
Bellemare
M. G.
&
Pineau
J.
(
2018
)
An introduction to deep reinforcement learning
,
MAL
,
11
,
219
354
.
doi:10.1561/2200000071
.
Gaborit
E.
,
Muschalla
D.
,
Vallet
B.
,
Vanrolleghem
P. A.
&
Anctil
F.
(
2013
)
Improving the performance of stormwater detention basins by real-time control using rainfall forecasts
,
Urban Water Journal
,
10
,
230
246
.
doi:10.1080/1573062X.2012.726229
.
Griffiths
J. A.
, (
2017
)
Sustainable urban drainage
. In:
Abraham
M. A.
(ed.)
Encyclopedia of Sustainable Technologies
,
Oxford
Elsevier
, pp.
403
413
.
doi:10.1016/B978-0-12-409548-9.10203-9
.
Haan
C. T.
,
Barfield
B. J.
,
Hayes
J. C.
, (
1994
)
9 - Sediment control structures
. In:
Haan
C. T.
,
Barfield
B. J.
&
Hayes
J. C.
(eds.)
Design Hydrology and Sedimentology for Small Catchments
,
San Diego
Academic Press
, pp.
311
390
.
doi:10.1016/B978-0-08-057164-5.50013-X
.
Hansen
S.
(
2016
)
Using Deep Q-Learning to Control Optimization Hyperparameters. doi:10.48550/arXiv.1602.04062
.
Lacour
C.
,
Joannis
C.
,
Schuetze
M.
&
Chebbo
G.
(
2011
)
Efficiency of a turbidity-based, real-time control strategy applied to a retention tank: A simulation study
,
Water Science and Technology
,
64
,
1533
1539
.
doi:10.2166/wst.2011.545
.
Li
K.
,
Zhang
T.
&
Wang
R.
(
2021
)
Deep reinforcement learning for multiobjective optimization
,
IEEE Transactions on Cybernetics
,
51
,
3103
3114
.
doi:10.1109/TCYB.2020.2977661
.
Lillicrap
T. P.
,
Hunt
J. J.
,
Pritzel
A.
,
Heess
N.
,
Erez
T.
,
Tassa
Y.
,
Silver
D.
&
Wierstra
D.
(
2015
)
Continuous control with deep reinforcement learning. arXiv preprint arXiv:1509.02971
.
Middleton
J. R.
&
Barrett
M. E.
(
2008
)
Water quality performance of a batch-type stormwater detention basin
,
Water Environment Research
,
80
,
172
178
.
doi:10.2175/106143007X220842
.
Mnih
V.
,
Kavukcuoglu
K.
,
Silver
D.
,
Rusu
A. A.
,
Veness
J.
,
Bellemare
M. G.
,
Graves
A.
,
Riedmiller
M.
,
Fidjeland
A. K.
,
Ostrovski
G.
,
Petersen
S.
,
Beattie
C.
,
Sadik
A.
,
Antonoglou
I.
,
King
H.
,
Kumaran
D.
,
Wierstra
D.
,
Legg
S.
&
Hassabis
D.
(
2015
)
Human-level control through deep reinforcement learning
,
Nature
,
518
,
529
533
.
doi:10.1038/nature14236
.
Mossalam
H.
,
Assael
Y. M.
,
Roijers
D. M.
&
Whiteson
S.
(
2016
)
Multi-Objective Deep Reinforcement Learning. doi:10.48550/arXiv.1610.02707
.
Mullapudi
A.
,
Bartos
M.
,
Wong
B.
&
Kerkez
B.
(
2018
)
Shaping streamflow using a real-time stormwater control network
,
Sensors (Basel, Switzerland)
,
18
,
2259
.
doi:10.3390/s18072259
.
Mullapudi
A.
,
Lewis
M. J.
,
Gruden
C. L.
&
Kerkez
B.
(
2020
)
Deep reinforcement learning for the real time control of stormwater systems
,
Advances in Water Resources
,
140
,
103600
.
doi:10.1016/j.advwatres.2020.103600
.
Park
D.
,
Jang
S.
&
Roesner
L. A.
(
2014
)
Evaluation of multi-use stormwater detention basins for improved urban watershed management
,
Hydrological Processes
,
28
,
1104
1113
.
doi:10.1002/hyp.9658
.
Rupprecht
T.
&
Wang
Y.
(
2022
)
A survey for deep reinforcement learning in markovian cyber–physical systems: Common problems and solutions
,
Neural Networks
,
153
,
13
36
.
doi:10.1016/j.neunet.2022.05.013
.
Schmitt
Z. K.
,
Hodges
C. C.
&
Dymond
R. L.
(
2020
)
Simulation and assessment of long-term stormwater basin performance under real-time control retrofits
,
Urban Water Journal
,
17
,
467
480
.
doi:10.1080/1573062X.2020.1764062
.
Shishegar
S.
,
Duchesne
S.
&
Pelletier
G.
(
2019
)
An integrated optimization and rule-based approach for predictive real time control of urban stormwater management systems
,
Journal of Hydrology (Amsterdam)
,
577
,
124000
.
doi:10.1016/j.jhydrol.2019.124000
.
Shishegar
S.
,
Duchesne
S.
,
Pelletier
G.
&
Ghorbani
R.
(
2021
)
A smart predictive framework for system-level stormwater management optimization
,
Journal of Environmental Management
,
278
,
111505
.
doi:10.1016/j.jenvman.2020.111505
.
Silver
D.
,
Schrittwieser
J.
,
Simonyan
K.
,
Antonoglou
I.
,
Huang
A.
,
Guez
A.
,
Hubert
T.
,
Baker
L.
,
Lai
M.
,
Bolton
A.
,
Chen
Y.
,
Lillicrap
T.
,
Hui
F.
,
Sifre
L.
,
van den Driessche
G.
,
Graepel
T.
&
Hassabis
D.
(
2017
)
Mastering the game of go without human knowledge
,
Nature
,
550
,
354
359
.
doi:10.1038/nature24270
.
Suleiman
L.
,
Olofsson
B.
,
Saurí
D.
,
Palau-Rof
L.
,
García Soler
N.
,
Papasozomenou
O.
&
Moss
T.
(
2020
)
Diverse pathways-common phenomena: Comparing transitions of urban rainwater harvesting systems in Stockholm, Berlin and Barcelona
,
Journal of Environmental Planning and Management
,
63
,
369
388
.
doi:10.1080/09640568.2019.1589432
.
Sutton
R. S.
&
Barto
A. G.
(
2018
)
Reinforcement Learning: An Introduction
.
Cambridge, MA
:
MIT Press
.
Tian
W.
,
Liao
Z.
,
Zhi
G.
,
Zhang
Z.
&
Wang
X.
(
2022
)
Combined sewer overflow and flooding mitigation through a reliable real-time control based on multi-reinforcement learning and model predictive control
,
Water Resources Research
,
58
(
7
),
e2021WR030703
.
doi:10.1029/2021WR030703
.
Tian
W.
,
Xin
K.
,
Zhang
Z.
,
Zhao
M.
,
Liao
Z.
&
Tao
T.
(
2023b
)
Flooding mitigation through safe & trustworthy reinforcement learning
,
Journal of Hydrology
,
620
,
129435
.
doi:10.1016/j.jhydrol.2023.129435
.
Vinyals
O.
,
Babuschkin
I.
,
Czarnecki
W. M.
,
Mathieu
M.
,
Dudzik
A.
,
Chung
J.
,
Choi
D. H.
,
Powell
R.
,
Ewalds
T.
,
Georgiev
P.
,
Oh
J.
,
Horgan
D.
,
Kroiss
M.
,
Danihelka
I.
,
Huang
A.
,
Sifre
L.
,
Cai
T.
,
Agapiou
J. P.
,
Jaderberg
M.
,
Vezhnevets
A. S.
,
Leblond
R.
,
Pohlen
T.
,
Dalibard
V.
,
Budden
D.
,
Sulsky
Y.
,
Molloy
J.
,
Paine
T. L.
,
Gulcehre
C.
,
Wang
Z.
,
Pfaff
T.
,
Wu
Y.
,
Ring
R.
,
Yogatama
D.
,
Wünsch
D.
,
McKinney
K.
,
Smith
O.
,
Schaul
T.
,
Lillicrap
T.
,
Kavukcuoglu
K.
,
Hassabis
D.
,
Apps
C.
&
Silver
D.
(
2019
)
Grandmaster level in StarCraft II using multi-agent reinforcement learning
,
Nature
,
575
,
350
354
.
doi:10.1038/s41586-019-1724-z
.
Wang
Z.
&
Hong
T.
(
2020
)
Reinforcement learning for building controls: The opportunities and challenges
,
Applied Energy
,
269
,
115036
.
doi:10.1016/j.apenergy.2020.115036
.
Wang
J.
,
Elfwing
S.
&
Uchibe
E.
(
2021
)
Modular deep reinforcement learning from reward and punishment for robot navigation
,
Neural Networks
,
135
,
115
126
.
doi:10.1016/j.neunet.2020.12.001
.
Xu
W. D.
,
Burns
M. J.
,
Cherqui
F.
,
Smith-Miles
K.
&
Fletcher
T. D.
(
2022
)
Coordinated control can deliver synergies across multiple rainwater storages
,
Water Resources Research
,
58
(
2
),
e2021WR030266
.
doi:10.1029/2021WR030266
.
Zhang
M.
,
Xu
Z.
,
Wang
Y.
,
Zeng
S.
&
Dong
X.
(
2022
)
Evaluation of uncertain signals’ impact on deep reinforcement learning-based real-time control strategy of urban drainage systems
,
Journal of Environmental Management
,
324
,
116448
.
doi:10.1016/j.jenvman.2022.116448
.
Zhao
W.
,
Queralta
J. P.
&
Westerlund
T.
(
2020
). '
Sim-to-Real Transfer in Deep Reinforcement Learning for Robotics: A Survey
',
2020 IEEE Symposium Series on Computational Intelligence (SSCI)
, pp.
737
744
.
doi:10.1109/SSCI47803.2020.9308468
.
This is an Open Access article distributed under the terms of the Creative Commons Attribution Licence (CC BY 4.0), which permits copying, adaptation and redistribution, provided the original work is properly cited (http://creativecommons.org/licenses/by/4.0/).