ABSTRACT
Stormwater basins are important stormwater control measures which can reduce peak flow rate, mitigate flooding volume, and improve water quality during heavy rainfall events. Previous control strategies for stormwater basins have typically treated water quality and quantity as separate objectives. With the increasing urban runoff caused by climate change and urbanization, current single-objective control strategies cannot fully harness the control potential of basins and therefore require improvement. However, designing multi-objective control strategies for basins is challenging because of the conflicting operation goals and the complexity of the dynamic environmental conditions. This research proposes a novel real-time control strategy based on deep reinforcement learning to address these challenges. It employs a deep Q-network to develop an agent capable of making control decisions. After being trained on three different rainfall events, the reinforcement learning agent can make appropriate decisions for previously unseen rainfall events. Compared to the other two rule-based control scenarios and a static state scenario, the deep reinforcement learning method is more effective in terms of reducing total suspended solids, reducing peak flow, minimizing outflow flashiness, and controlling effort, striking a good balance between conflicting control objectives.
HIGHLIGHTS
Deep reinforcement learning is effective in stormwater basin control with multiple objectives.
There is no direct correlation between training reward and performance.
A well-trained agent can adjust to unseen rainfall events.
INTRODUCTION
A stormwater basin, or stormwater detention pond, is a man-made facility designed to capture and manage the excess water generated during heavy rainfall. As surface storage facilities, they serve as a crucial tool for controlling stormwater runoff through attenuation and purification. They temporarily hold stormwater during rainfall, allowing water to infiltrate into the ground slowly or be released back into the environment at a controlled rate, subsequently reducing peak flows and flood volumes (Griffiths 2017). Additionally, these basins promote the deposition of particulate pollutants, thereby improving stormwater quality (Haan et al. 1994; Alam et al. 2018). Consequently, they represent a viable solution to the multitude of challenges posed by urban stormwater management. Typically, real-time control (RTC) devices are integrated at the outlets of basins, enabling them to adjust the water discharge rate in response to external conditions, thereby significantly enhancing the control effectiveness.
Although basins have multiple optimization capabilities, former basins' control usually employed single-objective strategies. Some focused solely on water quantity control, aiming to minimize overflow, mitigate peak flows, or decrease flow velocities downstream (Park et al. 2014; Schmitt et al. 2020). Others concentrate on water quality improvement, targeting reductions in total suspended solids (TSS) or other pollutants (Middleton & Barrett 2008; Lacour et al. 2011; Gaborit et al. 2013). While some of these control objectives are conflicting (for instance, lowering pollutant concentrations needs to close the valve to settle down which will lead to increased water levels and overflow in the basin), others can be reconciled (e.g., reduction in peak flow often enhances pollution removal efficiency). Consequently, in most cases, a single-objective control approach is not the optimal solution for overall control effectiveness, implying that it does not fully maximize the basin's potential.
In light of climate change, extreme precipitation events are becoming more frequent. The escalating urbanization drives an expansion of impermeable surfaces across urban landscapes. As a result of the combined influence of climate change and urbanization, the magnitude of urban stormwater runoff is increasing. This burgeoning runoff not only releases substantial volumes of pollutants into waterways (Brombach et al. 2005), but also stands as a primary culprit behind urban flooding, erosion of water bodies, rapid peak flows, and hydraulic disturbances in receiving streams (Borchardt & Sperling 1997). Given these mounting challenges, basins, serving as essential elements in urban rainwater management, need to adapt by shifting from single-objective control to multi-objective control to effectively manage the escalating urban runoff pressures.
Transferring from single-objective control to multi-objective control demands more sophisticated control strategies. Rule-based control strategies (i.e., if the water level exceeds a certain level, adjust the outlet rate to certain points) are commonly applied to RTC basins with single objective. These rules are designed by human experience. They are stable and easy to implement. But there are several disadvantages, for example, first, rule-based control algorithms are not dynamic enough to adapt to various rainfall events. Second, the control rules are generalized from human experience, which does not provide any guarantee of optimality. Third, it is very difficult to design suitable rules for basins with competing control objectives which is common in the real world (Zhang et al. 2023). Thus, the rule-based control method is not suitable for multiple objectives control of basins.
To address these limitations, this study explores the application of deep reinforcement learning (DRL) as a novel control strategy. DRL is a category of machine learning that aims to learn from trial-and-error experience through interactive learning with the environment (Fayaz et al. 2021). With DRL, the control strategies can be learnt, instead of using predetermined rules (Sutton & Barto 2018). DRL can be combined with neural networks as function approximators which increases the flexibility to optimize control actions and has the potential to continually adapt operations to evolving environmental conditions (Lillicrap et al. 2015). With the advances of deep learning, DRL has successfully automated a variety of complex and human-level tasks, such as the game of Go (Silver et al. 2017), StarCraft II (Vinyals et al. 2019), matrix multiplication (Fawzi et al. 2022), optimal path finding for multi-arm manipulators (Suleiman et al. 2020), combined sewer control (Zhang et al. 2023), and building energy control (Wang & Hong 2020).
Former researchers in the stormwater management field also applied DRL to generate control strategies of urban drainage systems and achieved desirable outcomes (Mullapudi et al. 2020, 2018; Tian et al. 2022, 2023a, 2023b; Zhang et al. 2022). However, their control objectives are relatively simple, only considering flood mitigation and peak flow reduction without consideration of water quality and control effort. Despite the inherent compatibility between these control objectives, where flow velocity typically scales linearly with water depth, the management of this multi-objective challenge remains manageable. Nevertheless, as urban stormwater runoff escalates, the importance of incorporating parameters like pollution reduction effectiveness and the mitigation of flash floods cannot be overlooked. The increase in control objectives introduces a new level of complexity, leaving the suitability of reinforcement learning (RL) for tackling such intricate tasks an open question, warranting further investigation. Besides, compared to other similar engineering fields such as building heating, ventilation, and air conditioning control and electric grid control, the potential of DRL on stormwater basins control is still yet poorly explored. It remains unclear how DRL will perform for the multiple objectives control of stormwater basins and how to improve the performance effectively.
This study applied DRL technology to generate control strategies for multi-objective control in a stormwater basin in a real-world watershed to maximize its pollution removal efficiency, overflow mitigation, peak flow reduction, and minimize control times and outflow flashness. The agent is trained based on a 12-h 2-year return period storm, 5-year return period storm, and a 20-year return period storm. It is then tested on a 12-h 10-year return period storm and a recorded real rainfall event.
METHODS
This section delves into the development, execution, and assessment of a DRL control strategy for managing multiple objectives in stormwater basins. The approach begins with the creation of a physics-based model to simulate the hydrodynamic and pollutant removal dynamics through a simulation software called the stormwater management model (SWMM). Then, a DRL agent is created using Python which can interact with the SWMM. The DRL algorithm is subsequently trained using synthetic rainfall data, ensuring its adaptability to diverse scenarios. The trained algorithm is then subjected to rigorous testing with both newly designed and real-world rainfall events. Comparative evaluations are conducted against three alternative control methods: water level-based control, pollution-focused control, and a static state strategy, to demonstrate the efficacy of the DRL approach.
Study area and rainfall events
Control objectives and evaluation metrics
Since this area traditionally served as a major focal point in the city's strategy to combat flooding and reduce runoff-driven water quality impairments (Bartos et al. 2018), with rising water flow issues, enhancing multi-objective rainwater management in the area is crucial for sustainable water use and environmental protection. In this study, the control objectives of the basin are:
(1) Maximize the detention time of the captured water to improve pollutant removal.
(2) Minimize overflow volume to mitigate flooding.
(3) Minimize peak outflow to reduce erosion in downstream riverbeds.
(4) Discharge the basin in a smooth way to minimize the hydraulic shocks induced to receiving water bodies and to minimize re-suspension.
(5) Minimize the frequency of orifice operations in order to prevent excessive wear and tear on the actuator.
The corresponding performance evaluation metrics are (a) overflow volume, (b) peak flow rate, (c) downstream cumulative TSS load, (d) control effort, and (e) outflow flashness. These metrics are defined mathematically in Table 1.
Performance criteria . | Quantitative performance measure . |
---|---|
Overflow | |
Peak outflow | max() |
Cumulative TSS load | |
Control effort | |
Outflow flashiness |
Performance criteria . | Quantitative performance measure . |
---|---|
Overflow | |
Peak outflow | max() |
Cumulative TSS load | |
Control effort | |
Outflow flashiness |
Formulation as reinforcement learning problem
The general idea of DRL control is to train an agent through interacting with an environment. By taking certain actions and receiving rewards or penalties in return, the agent learns how to behave better in a certain environment (Wang et al. 2021). The learning process is as follows: firstly, the agent initializes its knowledge (often randomly). It then interacts with the environment, observes the outcomes, and updates its knowledge based on the rewards it receives. The agent uses this knowledge to improve its policy over time. There are several important elements of RL, including agent, environment, states, actions, rewards, policy, value functions, exploration, and exploitation.
is the opening percentage of the orifice; is the orifice discharge coefficient, A is the cross-sectional area of the orifice; g is the gravitational acceleration; and is the height difference between the orifice and the water surface.
If the basin's water level decreases, a positive reward is granted; conversely, if the water level rises, control actions increase, and pollution levels rise, a negative reward is assigned. Overall, all these parameter settings are based on multiple trials, and the current parameters allow the agent to converge during training and learn effective control strategies.
A policy is a mapping from an agent's state or a sequence of states to an action. It represents the agent's decision-making strategy. The policy determines what action the agent should take in a given situation to maximize its long-term reward. A value function estimates the long-term reward an agent can expect to receive by following a particular policy. Exploration is the process by which an agent learns about the environment by taking actions that may lead to uncertain or unexplored outcomes. The goal of exploration is to gather information about the environment to improve the agent's policy. Exploration strategies often involve a balance between choosing actions based on the current knowledge (exploitation) and taking random or seemingly suboptimal actions to discover new information. Exploitation is the act of choosing the action that appears to be the best, based on the current knowledge and the agent's current policy. It is the process of maximizing the immediate reward by taking the action that is believed to yield the highest reward given the current state (François-Lavet et al. 2018; Sutton & Barto 2018; Fayaz et al. 2021). However, exploitation can lead to local optima if the agent does not explore enough.
Hyperparameters setting
In RL, hyperparameters are parameters that are not learned during the training process but are set by the researcher or developer before the learning algorithm begins (François-Lavet et al. 2018). These values influence the behavior and performance of the learning algorithm and tuning them is crucial for achieving optimal results.
The hyperparameters for the DQN agent are shown in Table 2 (Mnih et al. 2015). These values were determined through a combination of empirical testing and references from related literature. Regarding the network structure, 6 represents the number of input nodes, corresponding to the 6 input features (i.e. the current depth, current inflow, current outflow, current TSS concentration in basin, current inflow TSS concentration, and current opening percentage of controllable orifice) used by the agent. The hidden layer consists of 20 neurons, which was chosen to provide sufficient complexity for the model to capture underlying patterns in the data without overfitting. Finally, 11 represents the number of output nodes, corresponding to the 11 possible actions (i.e. 0, 10, 20, … 100% opening percentage) the agent can take. This structure was selected after testing different configurations to ensure that the model has enough capacity to learn effectively while maintaining efficiency in training and decision-making. The learning rate means that the agent integrates merely 0.1% of new information into its learning process. This rate was chosen to balance the need for slow, stable learning while avoiding large updates that could destabilize the agent. The discounted rate γ enables the agent to consider future rewards while maintaining a focus on immediate outcomes. This value was optimized through trials to account for long-term rewards without overwhelming short-term learning. The greedy rate ε means the agent will explore new actions 10% of the time, favoring exploitation of known actions 90% of the time, facilitating a balance between learning from new experiences and leveraging existing knowledge. The replay memory size and batch size were set based on tests aimed at allowing the agent to recall sufficient past experiences while maintaining training efficiency. The agent learned every five time-steps, a value chosen to reduce computational load and optimize learning time.
Hyperparameters . | Value . |
---|---|
Network structure | 6 × 20 × 11 |
Learning rate | 0.001 |
Discount rate | 0.9 |
ε-greedy rate (ε) | 0.1 |
Replay memory size (N) | 1,000 |
Batch size | 64 |
Learning interval | 5 times step |
Hyperparameters . | Value . |
---|---|
Network structure | 6 × 20 × 11 |
Learning rate | 0.001 |
Discount rate | 0.9 |
ε-greedy rate (ε) | 0.1 |
Replay memory size (N) | 1,000 |
Batch size | 64 |
Learning interval | 5 times step |
In this study, DQN is chosen as the approximator. It serves as a key component that approximates the value function. The neural network takes the state of the environment as input and processes it through multiple layers, allowing it to capture complex patterns and non-linear relationships between states and actions. This deep neural network is capable of generalizing across different states and providing an estimate of the optimal Q-value for each possible action. During the training phase, the approximator learns to minimize the difference between its predicted Q-values and the target Q-values, which are calculated based on the agent's current policy and the Bellman equation. This process of adjusting the network's weights through backpropagation enables the approximator to continually refine its estimates and improve the agent's decision-making abilities which enables the agent to make intelligent decisions in complex, sequential decision-making problems.
Control scenarios
This study sets four control scenarios to compare the control performance. The first scenario is a static state scenario. The outlet is kept open totally all the time. The second scenario is a water level-based control. It keeps the outlet closed until the water level in the basin exceeds a certain point. To be specific, if the water level exceeds 2 m, the valve will fully open. If the water level is between 1 and 2 m and there is no inflow, the valve will be half-opened. If the water level is less than 1 m and no inflow exists, the valve will be closed. The third scenario is pollution concentration-based control. If the TSS concentration exceeds 15 mg/L and the water level exceeds 2.5 m, the valve will close. Otherwise, it will fully open. The last one is the DRL control scenario which uses the trained agent to control the outlet orifice. These scenarios are tested on a 10-year return period rainfall event as well as a real rainfall.
The selection of these three control scenarios is based on their prevalence as common control strategies, supported by numerous prior research studies (Gaborit et al. 2013; Shishegar et al. 2021, 2019). By comparing these scenarios, the performance of RL control can be clearly demonstrated, highlighting both its strengths and limitations. However, it is important to note that other advanced rule-based control strategies may exist that could potentially outperform those selected for this comparative analysis.
RESULTS
Once the simulation model and DRL agent were established and integrated, the training and testing process are carried out, followed by a performance comparison of the four control scenarios mentioned above.
Training result
Test results
In the subgraph ‘outflow’ of Figures 5 and 7, the TSS-based control method generates the highest peak flow due to its operational principle, which involves closing valves until TSS concentrations drop below a threshold. This leads to elevated basin water levels, resulting in significant peak flow upon valve re-opening. However, it can be inferred that a more gradual valve opening strategy, after TSS reaches the threshold, would help to manage peak flow more effectively. As Figure 8 illustrates, the TSS control strategy exhibits better performance in several metrics such as control effort and overflow volume, with its overall area ranked second only to the RL approach.
DISCUSSION
In this research, we selected 100 training episodes for the DRL agent. To test the robustness of the agent's learning process, we also evaluated control performance after 20 and 50 training episodes. Interestingly, the control results after 20, 50, and 100 episodes were identical, suggesting that the critical behavioral patterns were learned relatively early in the training process. This observation indicates that although the total reward continues to increase with more episodes, the agent's performance stabilizes after a certain point. This is a significant finding, as training time is a key factor limiting the practical implementation of DRL in real-world projects. By reducing the number of required training episodes, this research presents the possibility of significantly shortening the training time for DRL agents, making them more feasible for application in operational stormwater basin control.
For the designed rainfall, the DRL method has the smallest overall area, signifying its superior performance compared to the other three control strategies. For the real rainfall event, while the DRL method's overall area was comparable to the TSS approach, it still demonstrated the smallest area, suggesting its efficacy in striking a balance among multiple control objectives. Figures 6 and 8 illustrate that the DRL control strategy exhibits a straightforward approach, maintaining the valve opening at 10% throughout. Despite this simplicity, the DRL strategy outperforms the others in terms of pollution reduction and four other criteria. However, it is important to note that the DRL strategy did not fully exploit the storage capacity of the basin for pollutant removal. By keeping the valve closed after rainfall and open until the TSS fully settled down, the TSS load could be reduced without compromising other performance metrics. Thus, although the DRL strategy yields satisfactory operational results, it is apparently not the optimal solution. There is still room for improvement which needs further research.
Despite these limitations, the DRL control strategy demonstrated clear advantages over traditional control methods, particularly in balancing conflicting objectives. This provides strong preliminary evidence that RL can be effectively applied to the multi-objective control of stormwater basins.
In this research, we converted the multi-objective optimization problem to a single-objective problem through a weighted sum method. We chose this approach for its simplicity and computational efficiency in our current framework. The main problem of the weighted sum method is that it predisposes what the decision maker's preferences are, which may change with different events or decision-makers. A true multi-objective DRL method such as Pareto-based DRL techniques would allow for more flexibility by producing a Pareto-optimal set from which decision-makers can select based on their preferences at the time (Mossalam et al. 2016; Li et al. 2021).
FUTURE DIRECTION
In the training process, the agent is exploring in a way which can get the biggest reward, but the biggest reward way may not be the best operation way if the reward function does not represent the quantitative objectives accurately. Thus, when determining the reward function, consider how to quantify each reward component, such as time rewards, energy efficiency, and smooth driving, to precisely reflect the agent's performance on the task. Therefore, the design of the reward function should provide clear guidance, enabling the agent to learn the correct behavioral policy efficiently and effectively.
This study simplifies the multi-objective problem into a single-objective one using a weighted sum method. Future research should explore true multi-objective DRL techniques, such as Pareto-based methods, which allow for more flexible and customizable decision-making.
Additionally, while the current research focuses on a single basin, the coordinated control of multiple basins within a watershed remains an open challenge. Extending DRL to multi-basin systems presents new complexities but offers significant potential for more efficient and effective stormwater management (Xu et al. 2022; Zhang et al. 2023).
Lastly, this study is conducted within a simulation environment. Addressing the reality gap is crucial for the application of the trained agent in practical engineering scenarios (Zhao et al. 2020; Rupprecht & Wang 2022). Currently, there are no documented cases of successfully deploying trained DRL agents in real-world engineering applications, making this a highly promising avenue for future research. The successful deployment of DRL in real-world stormwater systems could significantly enhance the resilience and efficiency of urban water management practices.
CONCLUSIONS
This research presents a novel approach for controlling stormwater basin with multiple objectives using DRL. Our method involves training a DQN agent to autonomously make control decisions for the basin outlet within a simulation environment. The performance results demonstrate the superiority of our DRL approach over traditional control strategies in terms of reducing peak flow, preventing overflow with minimal control effort, reducing TSS load, and achieving a more optimal balance between these competing objectives. Besides, one key finding is that an increase in training rewards does not necessarily lead to improved performance, even in the early stages of training. This observation suggests that RL agents can achieve optimal behavior relatively early, reducing the time required for training – a critical factor in real-world applications. The findings also underscore the potential of DRL to adapt to varying rainfall scenarios and provide real-time, efficient control solutions for stormwater management.
The result of this research lays a solid groundwork for future endeavors in collaborative control across multiple basins with multiple control objectives. DRL emerges as a promising tool for managing complex collaborative control scenarios in stormwater management, thereby fostering the advancement of intelligent water management practices in urban settings.
ACKNOWLEDGEMENTS
This study is financially supported by the National Natural Science Foundation of Jiangsu Province (BK20230099), the National Natural Science Foundation of China (52379061), the fund of National Key Laboratory of Water Disaster Prevention (5240152C2) and the Key Laboratory of Urban Water Supply, Water Saving and Water Environment Governance in the Yangtze River Delta of Ministry of Water Resources (grant no. 2020-3-3).
DATA AVAILABILITY STATEMENT
All relevant data are available from https://github.com/Dad-of-xushao/DRL-for-multiple-objectives-control-of-basin.git.
CONFLICT OF INTEREST
The authors declare there is no conflict.