Abstract
Optimal operation of membrane bioreactor (MBR) plants is crucial to save operational costs while satisfying legal effluent discharge requirements. The aeration process of MBR plants tends to use excessive energy for supplying air to micro-organisms. In the present study, a novel optimal aeration system is proposed for dynamic and robust optimization. Accordingly, a deep reinforcement learning (DRL)-based optimal operating system is proposed, so as to meet stringent discharge qualities while maximizing the system's energy efficiency. Additionally, it is compared with the manual system and conventional reinforcement learning (RL)-based systems. A deep Q-network (DQN) algorithm automatically learns how to operate the plant efficiently by finding an optimal trajectory to reduce the aeration energy without degrading the treated water quality. A full-scale MBR plant with the DQN-based autonomous aeration system can decrease the MBR's aeration energy consumption by 34% compared to other aeration systems while maintaining the treatment efficiency within effluent discharge limits.
INTRODUCTION
A membrane bioreactor (MBR) plant consists of an activated sludge wastewater treatment system retrofitted with a membrane reactor. This configuration has been extensively applied to wastewater treatment systems because it can reduce footprints, produce less sludge, and improve the effluent quality (Van den Broeck et al. 2012). However, the dynamic and unstable influent patterns hinder the stability of the effluent quality in MBR plants. Therefore, there has been growing interest in the efficient operation of MBRs due to stringent environmental regulations. Water pollution is now considered a serious environmental problem; accordingly, the regulation of effluent has been strict and advanced systems have become important (Åmand et al. 2013).
Aeration is an important factor in the operation of wastewater treatment plants from both environmental and economic viewpoints. The aeration process is deeply linked to micro-organism activity, and accounts for half of the electricity used in wastewater treatment (Hernández-Del-Olmo et al. 2012; Åmand et al. 2013). Therefore, the optimization of the aeration system is vital to managing the aeration process properly. An optimized system can improve the quality of the effluent based on microorganism activity while reducing the aeration energy.
Development of an optimal aeration system is necessary for an efficient, eco-friendly operation of the process. Moreover, process optimization reduces operational costs by balancing clean productivity with energy requirements. Process optimization techniques can improve the processes' efficiencies without many changes in physical structures. Thus, process optimization has occupied a significant role in the wastewater treatment research (Mannina & Cosenza 2013).
Traditionally, complex mechanisms of wastewater treatment processes were employed, taking environmental and economic aspects into account. Bournazou et al. (2013) suggested a transformed nonlinear program optimization technique to optimize the aeration profile of a sequencing batch reactor process. The suggested optimization approach was relatively fast during a computation procedure with a robust performance. Faridnasr et al. (2016) optimized the aeration time of a moving bed biofilm sequencing batch reactor by using a kinetic computational model. Fan et al. (2017) identified optimal dissolved oxygen (DO) setpoints in an aerobic reactor using the growth kinetics parameters under different air flow rates and mixed liquor volatile suspended solids concentrations. The model-based optimization of the aeration system was often developed using empirical mathematical approaches. However, the dynamics of wastewater increases the complexity of models and computation run-time during mathematical optimization. Thus, mathematical techniques have not extensively been applied to global optimization problems.
Taking the operational conditions of wastewater treatment plants in to account, influent is significantly dependent on climate change effects (Vo et al. 2014). Hence, empirical models are not efficient enough to solve the sophisticated problem with many uncertainties. In contrast, an artificial intelligence (AI) algorithm is a robust alternative tool to optimize wastewater treatment plants since it has several advantages, including flexibility, global transparency-optimization, automation, and continuity (Waschneck et al. 2018). Reinforcement learning (RL) is a powerful AI technique to optimize processes but its application in wastewater treatment research is rare.
This study aims to develop a novel optimal autonomous aeration system for a full-scale MBR plant based on the RL algorithm. The autonomous aeration system was used to propose an optimal trajectory for DO concentrations in the aerobic reactor of the full-scale MBR plant. The deep reinforcement learning (DRL) algorithms were compared to conventional RL algorithms, and their performances were evaluated considering economic and environmental improvements in the plant. The DRL algorithms, deep Q network (DQN) and deep-state-action-reward-state-action (deep-SARSA), and the conventional RL algorithms including Q-learning and SARSA were employed to improve the aeration system.
The proposed framework to determine the optimal DO trajectory of the aeration system using the AI algorithm is shown in Figure 1. The proposed approach consists of two parts: (1) selecting the structures of the RL algorithms to be employed in the MBR plant and (2) proposing the optimal DO trajectory in the aeration system considering the influent and operational data of the plant. First, the structure of the RL algorithms was selected based on the information obtained from the MBR plant and the data provided by the operators. An activated sludge model-soluble microbial product (ASM-SMP) was calibrated to describe the complex biological and physical interactions between micro-organisms and membranes in the MBR plant (Mannina & Cosenza 2013). Then, the RL structure was selected using the state, action, and reward according to the information of the plant and operational experiences in order to assess the economic benefits and environmental improvements. Moreover, the aeration energy (AE) and effluent quality index (EQI) were selected as crucial factors for the computation of the reward value considering the energy efficiency and the environmental stability of the process. The factors were estimated by the ASM-SMP as developed by Alex et al. (2008). Second, the RL algorithms were applied to determine the optimal DO concentration trajectory in the aeration system using the composed structures. Since the influent of the MBR plant had a periodic diurnal pattern, the DO concentration trajectory was proposed daily using the RL algorithms. Therefore, the RL algorithms were employed with 47 time intervals, whereas the optimal DO value was determined in every 30 min.
The manual system with the fixed DO concentration was compared to four RL algorithm-based systems to determine the robustness of the proposed autonomous aeration system. The aeration systems were based on Q-learning, SARSA, deep-SARSA, and DQN algorithms, respectively. Subsequently, the suggested RL algorithms were compared taking the EQI and AE into account. The superior operational trajectory was determined in a system with the highest AE saving potential while maintaining removal efficiency over the target. This trajectory was selected to be applied to two scenarios including (1) one-day operational and (2) one-month operational periods in order to verify the effectiveness and robustness of the performances of the suggested operational trajectory-searching approaches.
METHODS
Reinforcement learning algorithms
RL is an AI-based autonomous decision-making algorithm that solves optimization problems using an agent interacting with an environment while receiving reward values from each simulation time. The action is determined by a given policy that maximizes cumulative reward from a state to action. Then, the RL algorithm searches the optimal policy. A deep learning (DL) algorithm can be used to estimate the policy that had successful results in the files of image and language processing. In this study, the RL algorithm, case-study plant, aeration system, influent characteristics, and required energy and effluent quality represent the Agent, Environment, Action, State, Rewards, respectively (Spielberg et al. 2017). This study employed Q-learning, SARSA, deep-SARSA, and DQN to optimize the aeration system of the MBR plant.
DQN is one of the DRL algorithms that overcomes the limitations of the conventional RL algorithms. DQN is the Q-learning based DL technique, which is referred to as a Q-network, to maximize the rewards from the actions. This approach is trained by the interaction between agent and environment while obtaining an optimal policy from the historical dataset. The DQN automatically obtains an operational policy from a high-dimensional input of historical observations in a flexible manner. Moreover, an experience replay memory saves the current state, action, next state, and reward (, , , and ). The experience replay memory is a novel variant to randomize the data, remove the correlations over the sequence data, and smooth changes in data distribution. Thus, it can prevent and reduce potential correlations disturbing the determination of optimal action that occurred in conventional RL algorithms. The stored dataset is used to train the DQN algorithm reusing periodical and episodic experiences to efficiently learn and avoid overfitting. The structural advantages of the DQN algorithm enhance its capability to learn new data (Mnih et al. 2015; Spielberg et al. 2017). Details of Q-learning, SARSA, and deep-SARSA are described in Supplementary Material (A).
An environment of RL algorithms using an hourly calibrated ASM-SMP
ASM-SMP was employed as the environment of the RL-based optimization algorithm. The combined ASM-SMP is the mature tool to model biological and physical phenomena of the MBR process, especially operational aspects in detail (Mannina & Cosenza 2013). ASM efficiently describes the biological phenomena of micro-organisms that remove organic matter and nitrogen compounds (Naessens et al. 2012). ASM-SMP in a the case-study should be calibrated to adjust kinetic parameters for the exact modeling of micro-organism activity. Therefore, the kinetic parameters were calibrated using both a sensitivity analysis and a genetic algorithm based on hourly measured data in the plant. Thus, the hourly-calibrated ASM-SMP model was used in the RL algorithms as the environment, which provided state information and reward values according to the action values. The output of the hourly-calibrated ASM-SMP estimated effluent quality and energy consumption in the plant.
Structure determination of RL algorithms for optimization
Figure 2 depicts the structure of the RL algorithm to obtain an optimal operating trajectory for the aerobic reactor. Here, the structure of the DQN algorithm is emphasized because it has the most complex structure among the proposed RL algorithms. The structure of the RL algorithm consisted of the learning and the process operation loops. In the learning loop, the Q-network updated its internal weights (w) sequentially and periodically to maximize reward values compared to the target Q-value. Here, the historical and Gaussian-noised influent data were input to the RL algorithm when the learning loops were progressing to obtain global optimum operational trajectories. The number of learning loops implied the number of epochs in a training procedure. Having finished the learning loops, the Q-network obtained global optimum weights in the RL algorithm, and determined the new optimal operational trajectory according to new influent conditions of the plant in the operational loop of the process.
The states were operational process variables such as COD, SS, and TN concentrations and flowrate of influent. The state is used as input vector to the RL algorithms in the learning and process operation loops, as shown in Figure 2. The operational actions of the MBR plant were selected by a Q-value in the Q-function of the deep neural networks (DNNs) that employed a ɛ-greedy policy to improve the efficiency of the algorithm. The action variable, , represented the DO setpoint in the aerobic reactor of the MBR. The DO setpoints were determined among 13 possible actions varying from 2 mg/L to 7 mg/L at 0.5 mg/L intervals. The RL algorithms selected the optimal DO concentration from 13 DO options at each time. The optimal DO concentration trajectories were used as input of the calibrated ASM-SMP. The detailed results of the model calibration are discussed in Supplementary Material (B). The reward function of the DQN used AE and EQI, in which the reward was assigned according to the system's improvement. Moreover, a negative reward value was given to the agent of the algorithm to prevent the inferior economic and environmental performances of the MBR plant. The RL training pseudocode for the aeration system's optimization is summarized in Supplementary Material (C).
In order to balance between the EQI and AE, the weight factor (ω) was determined to be 5 in the reward function (Figure 2). Moreover, the weight increased 10-fold, meaning the effluent concentrations were greater than the legal limit quantities in effluent discharge. The reward values of the RL algorithms increased by better actions, which were automatically learned from the historical data of action-environment-state in the direction of improved operating conditions with respect to reduced operational costs. In deep-SARSA and the DQN cases, DNNs were employed to determine the optimal policy for the actions. The DNNs were equipped with two hidden layers (64 and 32 neurons) using a rectified linear unit (ReLU) activation function, as given in Supplementary Material (D). Additionally, biases were added to each hidden layer to enhance the performance of the DNNs. The aforementioned structure of the RL algorithm is summarized in Table 1.
Structure . | Description . |
---|---|
Agents | Q-learning, SARSA, deep-SARSA, and deep Q network |
Environment | Calibrated ASM-SMP model |
State | 14 (time, COD, TN, NO, NH, SS, and influent flowrate during past and current time) |
Action | 13 (the number of cases of DO concentration for aerobic reactor) |
Reward | |
DNNs (deep-SARSA and DQN) | The first hidden layer: 64 neurons (ReLU) The second hidden layer: 32 neurons (ReLU) |
Structure . | Description . |
---|---|
Agents | Q-learning, SARSA, deep-SARSA, and deep Q network |
Environment | Calibrated ASM-SMP model |
State | 14 (time, COD, TN, NO, NH, SS, and influent flowrate during past and current time) |
Action | 13 (the number of cases of DO concentration for aerobic reactor) |
Reward | |
DNNs (deep-SARSA and DQN) | The first hidden layer: 64 neurons (ReLU) The second hidden layer: 32 neurons (ReLU) |
Full-scale MBR plant
The case study was a full-scale MBR plant located in M-city, Korea. The treatment process consisted of five-series reactors including anaerobic, stabilizing, anoxic, aerobic, and membrane reactors. Regardless of the influent characteristics, the DO concentrations were manually operated at 4 mg/L and 7 mg/L, respectively, in the aerobic and membrane reactors to guarantee the stability of the whole processes. The DO concentration was set high to supply coarse bubbles to prevent fouling. Thus, the aerobic reactor was the target process for the autonomous trajectory searching system. High electricity consumption was the main drawback of the aerobic reactor, as it consumed one-third of the plant's total energy.
The average influent flowrate was 15,000 m3/day, and average COD, SS, and TN concentrations were 360, 173, and 45 mg/L, respectively. The COD, SS, and TN concentrations should be maintained below 52, 10, 20 mg/L in the effluent discharge, respectively, to meet standards of wastewater treatment effluent quality issued by the Korean Ministry of Environment. The influent pollutants increased from 10 am to 20 pm and decreased during nighttime due to household activities (Camacho-Muñoz et al. 2014). The peak values of the pollutants were 1.3 times greater than the average quantities, and the minimum values were half of those. Hence, the optimal aeration system was required to meet the standard limits in the effluent discharge considering diurnal and dynamic influent characteristics.
RESULTS AND DISCUSSION
Training procedures of the RL algorithms for autonomous system
Figure 3 shows the reward values in the suggested DQN-based autonomous system versus accumulative number of epochs. The number of epochs implies the number of operations in learning loops, shown in Figure 2. In other words, 10,000 learning loops were totally iterated to train the RL algorithm using historical and Gaussian-noised influent data. Here, the average reward values per 10 epochs are shown to explicitly account for the increasing tendency of the reward value. The RL algorithms automatically opened the way for increasing the rewards without any know-how and basic knowledge from the operators. The details of the training procedures in the conventional RL-based autonomous systems are elucidated in Supplementary Material (E).
The rewards converged to a higher value (near 50) as the training progressed. It should be noted that the training began with negative values of the rewards, that is the aeration energy and the pollutant concentration in the effluent were not decreased. This was because the initial actions of the agents occurred with random actions at the beginning of the training procedure. The DQN accumulated the elements of the state, action, and rewards in its experience replay memory for future exploration until 2,000 epochs were trained. As the experience replay memory was filled by random actions and their corresponding elements, the DQN algorithm started to learn how to increase the rewards by searching optimal DO trajectories in the aerobic reactor. Consequently, the DQN algorithm obtained maximum converged reward values after 4,000 epochs during the training. The trained parameters of the Q-network were used for the process operation loop (shown in Figure 2) that were updated after 4,000 epochs. The DQN computed optimal DO setpoints according to the time via the influent information (state, ) obtained from the full-scale MBR plant. Here, the aeration system was optimized to save the operating expenditure (OPEX) efficiently. After confirming the adaptive capabilities of the RL-based system to the aeration process, the risk of membrane fouling would be considered by a multi-agent RL algorithm in a future study.
The RL-based autonomous system for a one-day operation
Figure 4 shows the optimal operational trajectories for a one-day operation obtained by the proposed DQN algorithm. Both environmental and economic objectives were targeted to operate the MBR plant. The optimal trajectory and the corresponding consumed AE of the conventional RL algorithms are detailed in Supplementary Material (F).
The DQN-based autonomous system had a stable optimal DO trajectory during the operational time. All the DO setpoints of the DQN-based system were lower than those of the manual system. Moreover, the optimal operational trajectory had an explicit fluctuating trend that coincided with the influent pollutants' tendency. The loading profile over the 24 hours showed that it mainly increased from 10 am to 20 pm and decreased during nighttime. This profile explicitly showed the influent pollutant's tendency in short-term (Camacho-Muñoz et al. 2014). According to the loading profile, the DQN-based autonomous system manipulated the operational DO trajectory by reflecting the fluctuating tendency. Compared to the conventional RL-based autonomous systems, the DQN-based system proposed a smooth variation in the DO concentration setpoints. Thus, the variation of aeration energy was relatively low, so the total consumed energy effectively decreased.
The performance of the proposed operational autonomous system in the full-scale MBR plant is summarized in Table 2. The manual system, which kept fixed DO concentration at 4 mg/L in the aerobic reactor, showed that the EQI was 1,442.99 kg for a one-day operation. The average COD, SS, TN were 16.43 mg/L, 0.71 mg/L, and 6.68 mg/L in the effluent, respectively. The effluent concentrations were less than the limits set by discharge standards. The AE was 4,710 kW/day, and the plant used 471 USD OPEX to supply air to the aerobic reactor. The cost of an industrial energy with respect to the size of the case-study plant was assumed to be 0.1 USD/kWh in South Korea.
Aeration system . | EQI [kg] . | Improvement [%] . | AE [kW] . | saving [%] . |
---|---|---|---|---|
Manual | 1,442.99 | – | 4,710 | – |
Q-learning | 1,442.03 | −0.01 | 5,813 | −23.41 |
SARSA | 1,444.17 | −0.15 | 5,536 | −17.53 |
Deep-SARSA | 1,453.20 | −0.77 | 5,208 | −10.57 |
DQN | 1,441.51 | 0.03 | 3,118 | 33.18 |
Aeration system . | EQI [kg] . | Improvement [%] . | AE [kW] . | saving [%] . |
---|---|---|---|---|
Manual | 1,442.99 | – | 4,710 | – |
Q-learning | 1,442.03 | −0.01 | 5,813 | −23.41 |
SARSA | 1,444.17 | −0.15 | 5,536 | −17.53 |
Deep-SARSA | 1,453.20 | −0.77 | 5,208 | −10.57 |
DQN | 1,441.51 | 0.03 | 3,118 | 33.18 |
The reduction of electricity consumption was essential because of the considerable aeration energy consumption in the case-study plant. Aeration energy minimization via decreasing the operational trajectory of DO concentration was even more important than the EQI minimization. Thus, the proposed optimal aeration system could be helpful for operation taking OPEX into account because the effluent quality already complied with the discharge standards as a result of satisfactory efficiency for removal of organic matters and nutrient compounds.
The DQN-based autonomous system improved the aeration energy saving up to 33.41% by searching the optimal operational trajectory while maintaining the effluent quality with high treatment efficiency. This was because the DQN algorithm focused on reducing the OPEX rather than the improvement of the effluent quality. The achievement was superior to the previous studies on the optimization of aeration system, in which a multilevel system improved energy efficiency by 21%, and a multi-adaptive regression spline-based system improved aeration energy by 31% (Ferrero et al. 2011; Asadi et al. 2017). Furthermore, the reduction of aeration energy helped the OPEX of the aerobic reactor decrease that saved 158 USD in a one-day operation. This means that the developed DQN-based autonomous trajectory searching system not only decreased the operating costs but also maintained the main goal of the process to treat the wastewater effectively. Hence, the DQN-based autonomous system could be the environmental and informatics core system for the aeration process of the full-scale MBR plant.
Evaluation of the DQN-based autonomous operational trajectory searching system
A biological analysis of the system was conducted to approve the applicability of the proposed operational autonomous system at the full-scale MBR plant. Figure 5 compares the effluent quality in the optimal DQN-based process with the manually operated system, which kept the DO concentration at 4 mg/L without considering the influent dynamics. Although the overall DO setpoints decreased by 33.18%, the effluent COD concentrations were maintained at similar values to the manual system. Taking the nitrogen compounds into account, the overall concentrations of NO were less than the manual operation system, on the other hand, the Nkj concentrations increased to some extent. This was due to the equation of EQI, which consisted of several pollutant compounds and their corresponding weights. The average weighted effluent NO concentration was 58.26 mg/L, and the average weighted effluent Nkj concentration was 25.59 mg/L in the manual system. Therefore, the DQN focused on decreasing NO concentration in the effluent rather than COD and Nkj by decreasing the DO concentrations and investigating nitrification processes. The agent of the DQN algorithm was trained to decrease NO concentration in the effluent while preventing the excessive aeration and the unnecessary intensity of nitrification. Therefore, the Nkj concentration increased by 8.96%, whereas the NO and TN concentrations, respectively, decreased by 4.06% and 2.40%. This indicated that the DQN-based autonomous system maintained high efficiency in nutrient compounds removal while effectively saving the operating costs.
The DQN-based autonomous optimal trajectory searching system was also employed in a long-term operation scenario to analyze its economic and environmental performance. The long-term scenario could verify adaption capability of the proposed DQN-based autonomous system because it had much more fluctuating influent pattern. In the long-term scenario, the autonomous system optimized the DO concentration trajectories for a one-month operation period. Results of the adaption of the autonomous trajectory searching system for the long-term scenario are depicted in Figure 6. The measured influent TN loading of the case-study varied significantly with respect to time. The minimum and maximum TN loadings were 11 kg/hour and 58 kg/hour, respectively, and these values were almost half and double of the average TN loading (27 kg/hour). The abrupt weather changes, such as rainfall, increased the loading on day 23 because combined sewers were mostly located in the area studied. Moreover, climate change cannot promise an even trend of influent data, so profound influences of temperature and precipitation should be directly considered (Vo et al. 2014). Thus, the effective robust autonomous system was required to respond to not only the diurnal pattern but also to highly varying loadings in the influent.
The DQN-based autonomous operational trajectory searching system proposed a DO trajectory for one month according to the hourly and daily varying influent concentration. The proposed optimal DO trajectory of the DQN, shown in Figure 6, has a similar fluctuating trend to the varying influent data. The DO concentration increased up to 3.5 mg/L (days 3–7) and 4 mg/L (days 21–23) when the TN concentration increased in the influent. The increased DO concentrations coincided with the increase in TN concentration very well. In addition, the DO concentrations of the autonomous system were smaller than the fixed DO setpoint of the manual system. The proposed DO concentration trajectories, which reflected the varying influent data, were directly related to the nitrification process in the plant. This indicated that the DQN-based autonomous system regulated aeration intensity according to the influent loadings. Therefore, the removal efficiency of nitrogen compounds was maintained at 83%, while the EQI decreased slightly from 45,991 kg to 45,868 kg.
Taking the economic performance of the autonomous system into account, the DQN proposed the lower DO concentration trajectories, and the AE effectively decreased from 135,430 kW to 85,955 kW. The DQN-based autonomous system had the ability to improve both environmental efficiency by 0.23% and economic benefits by 36.53%. Hence, the reduction in aeration energy consumption encompassed a monthly saving of 4,948 USD, which is 59,376 USD annually. Thus, the proposed DQN-based autonomous operational trajectory searching system was an environmental and economical solution to maintaining treatment efficiency and strengthening the energy-saving potential considering the dynamic influent loading conditions.
CONCLUSIONS
A novel RL-based autonomous operational trajectory searching system was developed to operate a full-scale MBR plant efficiently. The DQN algorithm showed better performance compared to other DRL and conventional RL models taking environmental and economic improvements into account. The DQN algorithm searched the robust and dynamic actions to suggest the optimal DO concentration trajectory based on the hourly calibrated ASM-SMP as the algorithm's environment condition. The rewards of the DQN algorithm were maximized during the training of the algorithm considering the improvements in environmental and economic efficiency. The DQN-based autonomous operating system could maintain wastewater treatment efficiency, and improve the aeration energy up to 36% in a one-month operation compared to the manual system that maintained the fixed DO setpoint. These results can guide the operators of wastewater treatment plants and save OPEX efficiently while considering the environmental efficiency of the treatment process. The research can be extended to a multi-agent RL autonomous system to improve the environmental and economic benefits and minimize risks from membrane fouling simultaneously. The application of the multi-agent RL system could be useful for the operation of the MBR process by increasing the lifetime of the membrane and saving OPEX additionally. It is anticipated that future study would support the accomplishment of a smart water industry from environmental and economic viewpoints.
ACKNOWLEDGEMENTS
This work was supported by the National Research Foundation (NRF) grant funded by the Korean government (MSIT) (No. NRF-2017R1E1A1A03070713), and Korea Ministry of Environment (MOE) as Graduate School specialized in Climate Change.
SUPPLEMENTARY MATERIAL
The Supplementary Material for this paper is available online at https://dx.doi.org/10.2166/wst.2020.053.
REFERENCES
Author notes
The first and second authors contributed equally to this paper.