Optimal operation of membrane bioreactor (MBR) plants is crucial to save operational costs while satisfying legal effluent discharge requirements. The aeration process of MBR plants tends to use excessive energy for supplying air to micro-organisms. In the present study, a novel optimal aeration system is proposed for dynamic and robust optimization. Accordingly, a deep reinforcement learning (DRL)-based optimal operating system is proposed, so as to meet stringent discharge qualities while maximizing the system's energy efficiency. Additionally, it is compared with the manual system and conventional reinforcement learning (RL)-based systems. A deep Q-network (DQN) algorithm automatically learns how to operate the plant efficiently by finding an optimal trajectory to reduce the aeration energy without degrading the treated water quality. A full-scale MBR plant with the DQN-based autonomous aeration system can decrease the MBR's aeration energy consumption by 34% compared to other aeration systems while maintaining the treatment efficiency within effluent discharge limits.

A membrane bioreactor (MBR) plant consists of an activated sludge wastewater treatment system retrofitted with a membrane reactor. This configuration has been extensively applied to wastewater treatment systems because it can reduce footprints, produce less sludge, and improve the effluent quality (Van den Broeck et al. 2012). However, the dynamic and unstable influent patterns hinder the stability of the effluent quality in MBR plants. Therefore, there has been growing interest in the efficient operation of MBRs due to stringent environmental regulations. Water pollution is now considered a serious environmental problem; accordingly, the regulation of effluent has been strict and advanced systems have become important (Åmand et al. 2013).

Aeration is an important factor in the operation of wastewater treatment plants from both environmental and economic viewpoints. The aeration process is deeply linked to micro-organism activity, and accounts for half of the electricity used in wastewater treatment (Hernández-Del-Olmo et al. 2012; Åmand et al. 2013). Therefore, the optimization of the aeration system is vital to managing the aeration process properly. An optimized system can improve the quality of the effluent based on microorganism activity while reducing the aeration energy.

Development of an optimal aeration system is necessary for an efficient, eco-friendly operation of the process. Moreover, process optimization reduces operational costs by balancing clean productivity with energy requirements. Process optimization techniques can improve the processes' efficiencies without many changes in physical structures. Thus, process optimization has occupied a significant role in the wastewater treatment research (Mannina & Cosenza 2013).

Traditionally, complex mechanisms of wastewater treatment processes were employed, taking environmental and economic aspects into account. Bournazou et al. (2013) suggested a transformed nonlinear program optimization technique to optimize the aeration profile of a sequencing batch reactor process. The suggested optimization approach was relatively fast during a computation procedure with a robust performance. Faridnasr et al. (2016) optimized the aeration time of a moving bed biofilm sequencing batch reactor by using a kinetic computational model. Fan et al. (2017) identified optimal dissolved oxygen (DO) setpoints in an aerobic reactor using the growth kinetics parameters under different air flow rates and mixed liquor volatile suspended solids concentrations. The model-based optimization of the aeration system was often developed using empirical mathematical approaches. However, the dynamics of wastewater increases the complexity of models and computation run-time during mathematical optimization. Thus, mathematical techniques have not extensively been applied to global optimization problems.

Taking the operational conditions of wastewater treatment plants in to account, influent is significantly dependent on climate change effects (Vo et al. 2014). Hence, empirical models are not efficient enough to solve the sophisticated problem with many uncertainties. In contrast, an artificial intelligence (AI) algorithm is a robust alternative tool to optimize wastewater treatment plants since it has several advantages, including flexibility, global transparency-optimization, automation, and continuity (Waschneck et al. 2018). Reinforcement learning (RL) is a powerful AI technique to optimize processes but its application in wastewater treatment research is rare.

This study aims to develop a novel optimal autonomous aeration system for a full-scale MBR plant based on the RL algorithm. The autonomous aeration system was used to propose an optimal trajectory for DO concentrations in the aerobic reactor of the full-scale MBR plant. The deep reinforcement learning (DRL) algorithms were compared to conventional RL algorithms, and their performances were evaluated considering economic and environmental improvements in the plant. The DRL algorithms, deep Q network (DQN) and deep-state-action-reward-state-action (deep-SARSA), and the conventional RL algorithms including Q-learning and SARSA were employed to improve the aeration system.

The proposed framework to determine the optimal DO trajectory of the aeration system using the AI algorithm is shown in Figure 1. The proposed approach consists of two parts: (1) selecting the structures of the RL algorithms to be employed in the MBR plant and (2) proposing the optimal DO trajectory in the aeration system considering the influent and operational data of the plant. First, the structure of the RL algorithms was selected based on the information obtained from the MBR plant and the data provided by the operators. An activated sludge model-soluble microbial product (ASM-SMP) was calibrated to describe the complex biological and physical interactions between micro-organisms and membranes in the MBR plant (Mannina & Cosenza 2013). Then, the RL structure was selected using the state, action, and reward according to the information of the plant and operational experiences in order to assess the economic benefits and environmental improvements. Moreover, the aeration energy (AE) and effluent quality index (EQI) were selected as crucial factors for the computation of the reward value considering the energy efficiency and the environmental stability of the process. The factors were estimated by the ASM-SMP as developed by Alex et al. (2008). Second, the RL algorithms were applied to determine the optimal DO concentration trajectory in the aeration system using the composed structures. Since the influent of the MBR plant had a periodic diurnal pattern, the DO concentration trajectory was proposed daily using the RL algorithms. Therefore, the RL algorithms were employed with 47 time intervals, whereas the optimal DO value was determined in every 30 min.

The manual system with the fixed DO concentration was compared to four RL algorithm-based systems to determine the robustness of the proposed autonomous aeration system. The aeration systems were based on Q-learning, SARSA, deep-SARSA, and DQN algorithms, respectively. Subsequently, the suggested RL algorithms were compared taking the EQI and AE into account. The superior operational trajectory was determined in a system with the highest AE saving potential while maintaining removal efficiency over the target. This trajectory was selected to be applied to two scenarios including (1) one-day operational and (2) one-month operational periods in order to verify the effectiveness and robustness of the performances of the suggested operational trajectory-searching approaches.

Figure 1

Schematic representation of the RL-based autonomous optimal trajectory searching system.

Figure 1

Schematic representation of the RL-based autonomous optimal trajectory searching system.

Close modal

Reinforcement learning algorithms

RL is an AI-based autonomous decision-making algorithm that solves optimization problems using an agent interacting with an environment while receiving reward values from each simulation time. The action is determined by a given policy that maximizes cumulative reward from a state to action. Then, the RL algorithm searches the optimal policy. A deep learning (DL) algorithm can be used to estimate the policy that had successful results in the files of image and language processing. In this study, the RL algorithm, case-study plant, aeration system, influent characteristics, and required energy and effluent quality represent the Agent, Environment, Action, State, Rewards, respectively (Spielberg et al. 2017). This study employed Q-learning, SARSA, deep-SARSA, and DQN to optimize the aeration system of the MBR plant.

DQN is one of the DRL algorithms that overcomes the limitations of the conventional RL algorithms. DQN is the Q-learning based DL technique, which is referred to as a Q-network, to maximize the rewards from the actions. This approach is trained by the interaction between agent and environment while obtaining an optimal policy from the historical dataset. The DQN automatically obtains an operational policy from a high-dimensional input of historical observations in a flexible manner. Moreover, an experience replay memory saves the current state, action, next state, and reward (, , , and ). The experience replay memory is a novel variant to randomize the data, remove the correlations over the sequence data, and smooth changes in data distribution. Thus, it can prevent and reduce potential correlations disturbing the determination of optimal action that occurred in conventional RL algorithms. The stored dataset is used to train the DQN algorithm reusing periodical and episodic experiences to efficiently learn and avoid overfitting. The structural advantages of the DQN algorithm enhance its capability to learn new data (Mnih et al. 2015; Spielberg et al. 2017). Details of Q-learning, SARSA, and deep-SARSA are described in Supplementary Material (A).

An environment of RL algorithms using an hourly calibrated ASM-SMP

ASM-SMP was employed as the environment of the RL-based optimization algorithm. The combined ASM-SMP is the mature tool to model biological and physical phenomena of the MBR process, especially operational aspects in detail (Mannina & Cosenza 2013). ASM efficiently describes the biological phenomena of micro-organisms that remove organic matter and nitrogen compounds (Naessens et al. 2012). ASM-SMP in a the case-study should be calibrated to adjust kinetic parameters for the exact modeling of micro-organism activity. Therefore, the kinetic parameters were calibrated using both a sensitivity analysis and a genetic algorithm based on hourly measured data in the plant. Thus, the hourly-calibrated ASM-SMP model was used in the RL algorithms as the environment, which provided state information and reward values according to the action values. The output of the hourly-calibrated ASM-SMP estimated effluent quality and energy consumption in the plant.

EQI and AE are powerful tools to describe the ability of nutrient removal and the intensity of energy consumption. Therefore, these have been globally selected to investigate and analyze the process. The EQI and AE are given in Equations (1) and (2) (Alex et al. 2008):
(1)
(2)
where T is the operational time from to , , , , , and are weighting factors for suspended solids (SS), chemical oxygen demand (COD), Kjeldahl nitrogen (Nkj), NO (nitrite and nitrate), and BOD (biological oxygen demand), respectively. The weighting factors of the effluent quality were 2, 1, 30, 10, and 2, respectively. is the saturation concentration of oxygen (8 mg/L), is the volume of the th reactor of total aerobic reactors, is the oxygen transfer coefficient.

Structure determination of RL algorithms for optimization

Figure 2 depicts the structure of the RL algorithm to obtain an optimal operating trajectory for the aerobic reactor. Here, the structure of the DQN algorithm is emphasized because it has the most complex structure among the proposed RL algorithms. The structure of the RL algorithm consisted of the learning and the process operation loops. In the learning loop, the Q-network updated its internal weights (w) sequentially and periodically to maximize reward values compared to the target Q-value. Here, the historical and Gaussian-noised influent data were input to the RL algorithm when the learning loops were progressing to obtain global optimum operational trajectories. The number of learning loops implied the number of epochs in a training procedure. Having finished the learning loops, the Q-network obtained global optimum weights in the RL algorithm, and determined the new optimal operational trajectory according to new influent conditions of the plant in the operational loop of the process.

Figure 2

Schematic diagram of the RL-based optimal trajectory searching algorithm in the MBR plant.

Figure 2

Schematic diagram of the RL-based optimal trajectory searching algorithm in the MBR plant.

Close modal

The states were operational process variables such as COD, SS, and TN concentrations and flowrate of influent. The state is used as input vector to the RL algorithms in the learning and process operation loops, as shown in Figure 2. The operational actions of the MBR plant were selected by a Q-value in the Q-function of the deep neural networks (DNNs) that employed a ɛ-greedy policy to improve the efficiency of the algorithm. The action variable, , represented the DO setpoint in the aerobic reactor of the MBR. The DO setpoints were determined among 13 possible actions varying from 2 mg/L to 7 mg/L at 0.5 mg/L intervals. The RL algorithms selected the optimal DO concentration from 13 DO options at each time. The optimal DO concentration trajectories were used as input of the calibrated ASM-SMP. The detailed results of the model calibration are discussed in Supplementary Material (B). The reward function of the DQN used AE and EQI, in which the reward was assigned according to the system's improvement. Moreover, a negative reward value was given to the agent of the algorithm to prevent the inferior economic and environmental performances of the MBR plant. The RL training pseudocode for the aeration system's optimization is summarized in Supplementary Material (C).

In order to balance between the EQI and AE, the weight factor (ω) was determined to be 5 in the reward function (Figure 2). Moreover, the weight increased 10-fold, meaning the effluent concentrations were greater than the legal limit quantities in effluent discharge. The reward values of the RL algorithms increased by better actions, which were automatically learned from the historical data of action-environment-state in the direction of improved operating conditions with respect to reduced operational costs. In deep-SARSA and the DQN cases, DNNs were employed to determine the optimal policy for the actions. The DNNs were equipped with two hidden layers (64 and 32 neurons) using a rectified linear unit (ReLU) activation function, as given in Supplementary Material (D). Additionally, biases were added to each hidden layer to enhance the performance of the DNNs. The aforementioned structure of the RL algorithm is summarized in Table 1.

Table 1

Specifications of the RL algorithms

StructureDescription
Agents Q-learning, SARSA, deep-SARSA, and deep Q network 
Environment Calibrated ASM-SMP model 
State 14 (time, COD, TN, NO, NH, SS, and influent flowrate during past and current time) 
Action 13 (the number of cases of DO concentration for aerobic reactor)
 
Reward 
 
DNNs (deep-SARSA and DQN) The first hidden layer: 64 neurons (ReLU)
The second hidden layer: 32 neurons (ReLU) 
StructureDescription
Agents Q-learning, SARSA, deep-SARSA, and deep Q network 
Environment Calibrated ASM-SMP model 
State 14 (time, COD, TN, NO, NH, SS, and influent flowrate during past and current time) 
Action 13 (the number of cases of DO concentration for aerobic reactor)
 
Reward 
 
DNNs (deep-SARSA and DQN) The first hidden layer: 64 neurons (ReLU)
The second hidden layer: 32 neurons (ReLU) 

Full-scale MBR plant

The case study was a full-scale MBR plant located in M-city, Korea. The treatment process consisted of five-series reactors including anaerobic, stabilizing, anoxic, aerobic, and membrane reactors. Regardless of the influent characteristics, the DO concentrations were manually operated at 4 mg/L and 7 mg/L, respectively, in the aerobic and membrane reactors to guarantee the stability of the whole processes. The DO concentration was set high to supply coarse bubbles to prevent fouling. Thus, the aerobic reactor was the target process for the autonomous trajectory searching system. High electricity consumption was the main drawback of the aerobic reactor, as it consumed one-third of the plant's total energy.

The average influent flowrate was 15,000 m3/day, and average COD, SS, and TN concentrations were 360, 173, and 45 mg/L, respectively. The COD, SS, and TN concentrations should be maintained below 52, 10, 20 mg/L in the effluent discharge, respectively, to meet standards of wastewater treatment effluent quality issued by the Korean Ministry of Environment. The influent pollutants increased from 10 am to 20 pm and decreased during nighttime due to household activities (Camacho-Muñoz et al. 2014). The peak values of the pollutants were 1.3 times greater than the average quantities, and the minimum values were half of those. Hence, the optimal aeration system was required to meet the standard limits in the effluent discharge considering diurnal and dynamic influent characteristics.

Training procedures of the RL algorithms for autonomous system

Figure 3 shows the reward values in the suggested DQN-based autonomous system versus accumulative number of epochs. The number of epochs implies the number of operations in learning loops, shown in Figure 2. In other words, 10,000 learning loops were totally iterated to train the RL algorithm using historical and Gaussian-noised influent data. Here, the average reward values per 10 epochs are shown to explicitly account for the increasing tendency of the reward value. The RL algorithms automatically opened the way for increasing the rewards without any know-how and basic knowledge from the operators. The details of the training procedures in the conventional RL-based autonomous systems are elucidated in Supplementary Material (E).

Figure 3

Variation of rewards versus the training epochs of the DQN-based autonomous system.

Figure 3

Variation of rewards versus the training epochs of the DQN-based autonomous system.

Close modal

The rewards converged to a higher value (near 50) as the training progressed. It should be noted that the training began with negative values of the rewards, that is the aeration energy and the pollutant concentration in the effluent were not decreased. This was because the initial actions of the agents occurred with random actions at the beginning of the training procedure. The DQN accumulated the elements of the state, action, and rewards in its experience replay memory for future exploration until 2,000 epochs were trained. As the experience replay memory was filled by random actions and their corresponding elements, the DQN algorithm started to learn how to increase the rewards by searching optimal DO trajectories in the aerobic reactor. Consequently, the DQN algorithm obtained maximum converged reward values after 4,000 epochs during the training. The trained parameters of the Q-network were used for the process operation loop (shown in Figure 2) that were updated after 4,000 epochs. The DQN computed optimal DO setpoints according to the time via the influent information (state, ) obtained from the full-scale MBR plant. Here, the aeration system was optimized to save the operating expenditure (OPEX) efficiently. After confirming the adaptive capabilities of the RL-based system to the aeration process, the risk of membrane fouling would be considered by a multi-agent RL algorithm in a future study.

The RL-based autonomous system for a one-day operation

Figure 4 shows the optimal operational trajectories for a one-day operation obtained by the proposed DQN algorithm. Both environmental and economic objectives were targeted to operate the MBR plant. The optimal trajectory and the corresponding consumed AE of the conventional RL algorithms are detailed in Supplementary Material (F).

Figure 4

Results of the DQN-based optimal autonomous aeration system applied to the full-scale MBR plant.

Figure 4

Results of the DQN-based optimal autonomous aeration system applied to the full-scale MBR plant.

Close modal

The DQN-based autonomous system had a stable optimal DO trajectory during the operational time. All the DO setpoints of the DQN-based system were lower than those of the manual system. Moreover, the optimal operational trajectory had an explicit fluctuating trend that coincided with the influent pollutants' tendency. The loading profile over the 24 hours showed that it mainly increased from 10 am to 20 pm and decreased during nighttime. This profile explicitly showed the influent pollutant's tendency in short-term (Camacho-Muñoz et al. 2014). According to the loading profile, the DQN-based autonomous system manipulated the operational DO trajectory by reflecting the fluctuating tendency. Compared to the conventional RL-based autonomous systems, the DQN-based system proposed a smooth variation in the DO concentration setpoints. Thus, the variation of aeration energy was relatively low, so the total consumed energy effectively decreased.

The performance of the proposed operational autonomous system in the full-scale MBR plant is summarized in Table 2. The manual system, which kept fixed DO concentration at 4 mg/L in the aerobic reactor, showed that the EQI was 1,442.99 kg for a one-day operation. The average COD, SS, TN were 16.43 mg/L, 0.71 mg/L, and 6.68 mg/L in the effluent, respectively. The effluent concentrations were less than the limits set by discharge standards. The AE was 4,710 kW/day, and the plant used 471 USD OPEX to supply air to the aerobic reactor. The cost of an industrial energy with respect to the size of the case-study plant was assumed to be 0.1 USD/kWh in South Korea.

Table 2

A comparison of the manual and RL-based autonomous systems in a one-day operation of the full-scale MBR plant

Aeration systemEQI [kg]Improvement [%]AE [kW]saving [%]
Manual 1,442.99 – 4,710 – 
Q-learning 1,442.03 −0.01 5,813 −23.41 
SARSA 1,444.17 −0.15 5,536 −17.53 
Deep-SARSA 1,453.20 −0.77 5,208 −10.57 
DQN 1,441.51 0.03 3,118 33.18 
Aeration systemEQI [kg]Improvement [%]AE [kW]saving [%]
Manual 1,442.99 – 4,710 – 
Q-learning 1,442.03 −0.01 5,813 −23.41 
SARSA 1,444.17 −0.15 5,536 −17.53 
Deep-SARSA 1,453.20 −0.77 5,208 −10.57 
DQN 1,441.51 0.03 3,118 33.18 

The reduction of electricity consumption was essential because of the considerable aeration energy consumption in the case-study plant. Aeration energy minimization via decreasing the operational trajectory of DO concentration was even more important than the EQI minimization. Thus, the proposed optimal aeration system could be helpful for operation taking OPEX into account because the effluent quality already complied with the discharge standards as a result of satisfactory efficiency for removal of organic matters and nutrient compounds.

The DQN-based autonomous system improved the aeration energy saving up to 33.41% by searching the optimal operational trajectory while maintaining the effluent quality with high treatment efficiency. This was because the DQN algorithm focused on reducing the OPEX rather than the improvement of the effluent quality. The achievement was superior to the previous studies on the optimization of aeration system, in which a multilevel system improved energy efficiency by 21%, and a multi-adaptive regression spline-based system improved aeration energy by 31% (Ferrero et al. 2011; Asadi et al. 2017). Furthermore, the reduction of aeration energy helped the OPEX of the aerobic reactor decrease that saved 158 USD in a one-day operation. This means that the developed DQN-based autonomous trajectory searching system not only decreased the operating costs but also maintained the main goal of the process to treat the wastewater effectively. Hence, the DQN-based autonomous system could be the environmental and informatics core system for the aeration process of the full-scale MBR plant.

Evaluation of the DQN-based autonomous operational trajectory searching system

A biological analysis of the system was conducted to approve the applicability of the proposed operational autonomous system at the full-scale MBR plant. Figure 5 compares the effluent quality in the optimal DQN-based process with the manually operated system, which kept the DO concentration at 4 mg/L without considering the influent dynamics. Although the overall DO setpoints decreased by 33.18%, the effluent COD concentrations were maintained at similar values to the manual system. Taking the nitrogen compounds into account, the overall concentrations of NO were less than the manual operation system, on the other hand, the Nkj concentrations increased to some extent. This was due to the equation of EQI, which consisted of several pollutant compounds and their corresponding weights. The average weighted effluent NO concentration was 58.26 mg/L, and the average weighted effluent Nkj concentration was 25.59 mg/L in the manual system. Therefore, the DQN focused on decreasing NO concentration in the effluent rather than COD and Nkj by decreasing the DO concentrations and investigating nitrification processes. The agent of the DQN algorithm was trained to decrease NO concentration in the effluent while preventing the excessive aeration and the unnecessary intensity of nitrification. Therefore, the Nkj concentration increased by 8.96%, whereas the NO and TN concentrations, respectively, decreased by 4.06% and 2.40%. This indicated that the DQN-based autonomous system maintained high efficiency in nutrient compounds removal while effectively saving the operating costs.

Figure 5

A comparison between the DQN-based autonomous operational trajectory searching and manual system considering effluent quality.

Figure 5

A comparison between the DQN-based autonomous operational trajectory searching and manual system considering effluent quality.

Close modal

The DQN-based autonomous optimal trajectory searching system was also employed in a long-term operation scenario to analyze its economic and environmental performance. The long-term scenario could verify adaption capability of the proposed DQN-based autonomous system because it had much more fluctuating influent pattern. In the long-term scenario, the autonomous system optimized the DO concentration trajectories for a one-month operation period. Results of the adaption of the autonomous trajectory searching system for the long-term scenario are depicted in Figure 6. The measured influent TN loading of the case-study varied significantly with respect to time. The minimum and maximum TN loadings were 11 kg/hour and 58 kg/hour, respectively, and these values were almost half and double of the average TN loading (27 kg/hour). The abrupt weather changes, such as rainfall, increased the loading on day 23 because combined sewers were mostly located in the area studied. Moreover, climate change cannot promise an even trend of influent data, so profound influences of temperature and precipitation should be directly considered (Vo et al. 2014). Thus, the effective robust autonomous system was required to respond to not only the diurnal pattern but also to highly varying loadings in the influent.

Figure 6

Application of the DQN-based autonomous operational trajectory searching system for one month's operation considering loading profile of influent TN and optimal DO setpoints of the aerobic reactor.

Figure 6

Application of the DQN-based autonomous operational trajectory searching system for one month's operation considering loading profile of influent TN and optimal DO setpoints of the aerobic reactor.

Close modal

The DQN-based autonomous operational trajectory searching system proposed a DO trajectory for one month according to the hourly and daily varying influent concentration. The proposed optimal DO trajectory of the DQN, shown in Figure 6, has a similar fluctuating trend to the varying influent data. The DO concentration increased up to 3.5 mg/L (days 3–7) and 4 mg/L (days 21–23) when the TN concentration increased in the influent. The increased DO concentrations coincided with the increase in TN concentration very well. In addition, the DO concentrations of the autonomous system were smaller than the fixed DO setpoint of the manual system. The proposed DO concentration trajectories, which reflected the varying influent data, were directly related to the nitrification process in the plant. This indicated that the DQN-based autonomous system regulated aeration intensity according to the influent loadings. Therefore, the removal efficiency of nitrogen compounds was maintained at 83%, while the EQI decreased slightly from 45,991 kg to 45,868 kg.

Taking the economic performance of the autonomous system into account, the DQN proposed the lower DO concentration trajectories, and the AE effectively decreased from 135,430 kW to 85,955 kW. The DQN-based autonomous system had the ability to improve both environmental efficiency by 0.23% and economic benefits by 36.53%. Hence, the reduction in aeration energy consumption encompassed a monthly saving of 4,948 USD, which is 59,376 USD annually. Thus, the proposed DQN-based autonomous operational trajectory searching system was an environmental and economical solution to maintaining treatment efficiency and strengthening the energy-saving potential considering the dynamic influent loading conditions.

A novel RL-based autonomous operational trajectory searching system was developed to operate a full-scale MBR plant efficiently. The DQN algorithm showed better performance compared to other DRL and conventional RL models taking environmental and economic improvements into account. The DQN algorithm searched the robust and dynamic actions to suggest the optimal DO concentration trajectory based on the hourly calibrated ASM-SMP as the algorithm's environment condition. The rewards of the DQN algorithm were maximized during the training of the algorithm considering the improvements in environmental and economic efficiency. The DQN-based autonomous operating system could maintain wastewater treatment efficiency, and improve the aeration energy up to 36% in a one-month operation compared to the manual system that maintained the fixed DO setpoint. These results can guide the operators of wastewater treatment plants and save OPEX efficiently while considering the environmental efficiency of the treatment process. The research can be extended to a multi-agent RL autonomous system to improve the environmental and economic benefits and minimize risks from membrane fouling simultaneously. The application of the multi-agent RL system could be useful for the operation of the MBR process by increasing the lifetime of the membrane and saving OPEX additionally. It is anticipated that future study would support the accomplishment of a smart water industry from environmental and economic viewpoints.

This work was supported by the National Research Foundation (NRF) grant funded by the Korean government (MSIT) (No. NRF-2017R1E1A1A03070713), and Korea Ministry of Environment (MOE) as Graduate School specialized in Climate Change.

The Supplementary Material for this paper is available online at https://dx.doi.org/10.2166/wst.2020.053.

Alex
J.
Benedetti
L.
Copp
J. B.
Gernaey
K. V.
Jeppsson
U.
Nopens
I.
Pons
M.-N.
Rieger
L. P.
Rosén
C.
Steyer
J.-P.
Vanrolleghem
P. A.
Winkler
S.
2008
Benchmark Simulation Model no. 1 (BSM1)
. In:
Report by the IWA Taskgroup on Benchmarking of Control Strategies for WWTPs
, pp.
19
20
.
Åmand
L.
Olsson
G.
Carlsson
B.
2013
Aeration control – a review
.
Water Science and Technology
67
(
11
),
2374
2398
.
Asadi
A.
Verma
A.
Yang
K.
Mejabi
B.
2017
Wastewater treatment aeration process optimization: a data mining approach
.
Journal of Environmental Management
203
,
630
639
.
Bournazou
M. C.
Hooshiar
K.
Arellano-Garcia
H.
Wozny
G.
Lyberatos
G.
2013
Model based optimization of the intermittent aeration profile for SBRs under partial nitrification
.
Water Research
47
(
10
),
3399
3410
.
Camacho-Muñoz
D.
Martín
J.
Santos
J. L.
Aparicio
I.
Alonso
E.
2014
Occurrence of surfactants in wastewater: hourly and seasonal variations in urban and industrial wastewaters from Seville (Southern Spain)
.
Science of the Total Environment
468
,
977
984
.
Ferrero
G.
Monclús
H.
Buttiglieri
G.
Comas
J.
Rodriguez-Roda
I.
2011
Automatic control system for energy optimization in membrane bioreactors
.
Desalination
268
,
276
280
.
Hernández-del-Olmo
F.
Llanes
F. H.
Gaudioso
E.
2012
An emergent approach for the control of wastewater treatment plants by means of reinforcement learning techniques
.
Expert Systems with Applications
39
(
3
),
2355
2360
.
Mnih
V.
Kavukcuoglu
K.
Silver
D.
Rusu
A. A.
Veness
J.
Bellemare
M. G.
Graves
A.
Riedmiller
M.
Fidjeland
A. K.
Ostrovski
G.
Petersen
S.
Beattie
C.
Sadik
A.
Antonoglou
I.
King
H.
Kumaran
D.
Wierstra
D.
Legg
S.
Hassabis
D.
2015
Human-level control through deep reinforcement learning
.
Nature
518
(
7540
),
529
.
Naessens
W.
Maere
T.
Ratkovich
N.
Vedantam
S.
Nopens
I.
2012
Critical review of membrane bioreactor models – part 2: hydrodynamic and integrated models
.
Bioresource Technology
122
,
107
118
.
Spielberg
S. P. K.
Gopaluni
R. B.
Loewen
P. D.
2017
Deep reinforcement learning approaches for process control
. In:
2017 6th International Symposium on Advanced Control of Industrial Processes
, pp.
201
206
.
Van den Broeck
R.
Van Dierdonck
J.
Nijskens
P.
Dotremont
C.
Krzeminski
P.
Van der Graaf
J. H. J. M.
van Lier
J. B.
Van Impe
J. F. M.
Smets
I. Y.
2012
The influence of solids retention time on activated sludge bioflocculation and membrane fouling in a membrane bioreactor (MBR)
.
Journal of Membrane Science
401
,
48
55
.
Vo
P. T.
Ngo
H. H.
Guo
W.
Zhou
J. L.
Nguyen
P. D.
Listowski
A.
Wang
X. C.
2014
A mini-review on the impacts of climate change on wastewater reclamation and reuse
.
Science of the Total Environment
494
,
9
17
.
Waschneck
B.
Reichstaller
A.
Belzner
L.
Altenmüller
T.
Bauernhansl
T.
Knapp
A.
Kyek
A.
2018
Optimization of global production scheduling with deep reinforcement learning
.
Procedia CIRP
72
(
1
),
1264
1269
.

Author notes

The first and second authors contributed equally to this paper.

This is an Open Access article distributed under the terms of the Creative Commons Attribution Licence (CC BY-NC-ND 4.0), which permits copying and redistribution for non-commercial purposes with no derivatives, provided the original work is properly cited (http://creativecommons.org/licenses/by-nc-nd/4.0/).

Supplementary data