Abstract

We consider the problem of optimal control of district cooling energy plants (DCEPs) consisting of multiple chillers, a cooling tower, and a thermal energy storage (TES), in the presence of time-varying electricity prices. A straightforward application of model predictive control (MPC) requires solving a challenging mixed-integer nonlinear program (MINLP) because of the on/off of chillers and the complexity of the DCEP model. Reinforcement learning (RL) is an attractive alternative since its real-time control computation is much simpler. But designing an RL controller is challenging due to myriad design choices and computationally intensive training. In this paper, we propose an RL controller and an MPC controller for minimizing the electricity cost of a DCEP, and compare them via simulations. The two controllers are designed to be comparable in terms of objective and information requirements. The RL controller uses a novel Q-learning algorithm that is based on least-squares policy iteration. We describe the design choices for the RL controller, including the choice of state space and basis functions, that are found to be effective. The proposed MPC controller does not need a mixed-integer solver for implementation, but only a nonlinear program (NLP) solver. A rule-based baseline controller is also proposed to aid in comparison. Simulation results show that the proposed RL and MPC controllers achieve similar savings over the baseline controller, about 17%.

1 Introduction

In the US, 75% of the electricity is consumed by buildings, and a large part of that is due to heating, ventilation, and air conditioning (HVAC) systems [1]. In university campuses and large hotels, a large portion of the HVAC’s share of electricity is consumed by district cooling energy plant s (DCEPs), especially in hot and humid climates. A DCEP produces and supplies chilled water to a group of buildings it serves (hence the moniker “district”), and the air handling units in those buildings use the chilled water to cool and dehumidify air before supplying it to building interiors. Figure 1 shows a schematic of such a plant, which consists of multiple chillers that produce chilled water, a cooling tower that rejects the heat extracted from chillers to the environment, and a thermal energy storage system (TES) for storing chilled water. Chillers—the most electricity-intensive equipment in the DCEP—can produce more chilled water than buildings’ needs when the electricity price is low. The extra chilled water is stored in the TES, and then used during periods of high electricity price to reduce the total electricity cost. The district cooling energy plant s are also called central plants or chiller plants.

Fig. 1
Layout of district cooling energy plant
Fig. 1
Layout of district cooling energy plant
Close modal

DCEPs are traditionally operated with rule-based control algorithms that use heuristics to reduce electricity cost while meeting the load, such as “chiller priority,” “storage priority,” and additional control sequencing for the cooling tower operation [28]. But making the best use of the chillers and the TES to keep the electricity cost at the minimum requires non-trivial decision-making due to the discrete nature of some control commands, such as chiller on/off actuation, and highly nonlinear dynamics of the equipment in DCEPs. A growing body of work has proposed algorithms for optimal real-time control of DCEPs. Both model predictive control (MPC) [917] and reinforcement learning (RL) [1826] have been studied.

For MPC, a direct implementation requires solving a high dimension mixed-integer linear program (MINLP) that is quite challenging to solve. Various substitutive approaches are thus used, which can be categorized into two groups: NLP approximations [912] and MILP approximations [1317]. NLP approximations generally leave the discrete commands for some predetermined control logic and only deal with continuous control commands, which may limit the potential of their savings. MILP approximations mostly adopt a linear DCEP model so that the problem is tractable, though solving large MILPs is also challenging.

An alternative to MPC is RL—an umbrella term for a set of tools used to approximate an optimal policy using data collected from a physical system, or more frequently, its simulation. Despite a burdensome design and learning phase, real-time control is simpler since control computation is an evaluation of a state-feedback policy. However, designing an RL controller for a DCEP is quite challenging. The performance of an RL controller depends on many design choices and training an RL controller is computationally onerous.

In this paper, we propose an RL controller and an MPC controller for a DCEP, and compare their performances with that of a rule-based BL controller through simulations. All three controllers are designed to minimize total energy cost while meeting the required cooling load. The main source of flexibility is the TES, which allows a well-designed controller to charge the TES in periods of low electricity price. The proposed RL controller is based on a new learning algorithm that is inspired by the “convex Q-learning” proposed in recent work [27] and the classical least-squares policy iteration (LSPI) algorithm [28]. Basis functions are carefully designed to reduce the computational burden in training the RL controller. The proposed MPC controller solves a two-fold nonlinear program (NLP) that is transformed from the original MINLP via heuristics. Hence the MPC controller is “stand-in” for a true optimal controller and provides a sub-optimal solution to the original MINLP. The baseline controller (BC) that is used for comparison is designed to utilize the TES and time-varying electricity prices (to the extent possible with heuristics) to reduce energy costs. The RL controller and baseline controller have the same information about electricity price: the current price and a backward moving average.

The objective behind this work is to compare the performance of the two complementary approaches, MPC and RL, for the optimal control of all the principal actuators in a DCEP. The two controllers are designed to be comparable, in terms of objective and information requirements. We are not aware of many works that have performed such a comparison; the only exceptions are [25,26], but the decision-making is limited to a TES or temperature setpoints. Since both RL and MPC approaches have merits and weaknesses, designing a controller with one approach and showing it performs well leaves open the question: would the other have performed better? This paper takes a first step in addressing such questions. To aid in this comparison, both the controllers are designed to be approximations of the same intractable infinite horizon optimal control problem. Due to the large difference in the respective approaches (MPC and RL), it is not possible to ensure exact parallels for an“apples–to–apples” comparison. But the design problems for RL and MPC controllers have been formulated to be similar to the possible extent.

Simulation results show that both the controllers, RL and MPC, lead to significant and similar cost savings (16–18%) over a rule-based baseline controller. These values are comparable to that of MPC controllers with mixed-integer formulation reported in the literature, which vary from 10% to 17% [1317]. The cooling load tracking performance is similar between them. The real-time computation burden of the RL controller is trivial compared to that of the MPC controller, but the RL controller leads to higher chiller switches (from off to on and vice versa). However, the MPC controller enjoys the advantage of error-free forecasts in the simulations, something the RL controller does not.

The remainder of the paper is organized as follows. The contribution of the paper over the related literature is discussed in detail in Sec. 1.1. Section 2 describes the district cooling energy plant and its simulation model as well as the control problem. Section 3 describes the proposed RL controller, Sec. 4 presents the proposed MPC controller, and Sec. 5 describes the baseline controller. Section 6 provides a simulation evaluation of the controllers. Section 7 provides an “under-the-hood” view of the design choices for the RL controller. Section 8 concludes the paper.

1.1 Literature Review and Contributions

1.1.1 Prior Work on Reinforcement Learning for DCEP.

There is a large and growing body of work in this area, e.g., Refs. [1826]. Most of these papers limit the problem to controlling part of a DCEP. For instance, the DCEPs considered in Refs. [1821,23] do not have a TES. References [1822] optimize only the chilled water loop but not the cooling water loop (at the cooling tower), while Ref. [24] only optimizes the cooling water loop. The reported energy savings are in the 10–20% range over rule-based baseline controllers, e.g., 15.7% in Ref. [23], 11.5% in Ref. [18], and around 17% in Ref. [21].

Reference [25] considers a complete DCEP, but the control command computed by the RL agent is limited to TES charging and discharging. It is not clear what control law is used to decide chiller commands and cooling water loop setpoints. Reference [26] also considers a complete DCEP, with two chillers, a TES, and a large building with an air handling unit. The RL controller is tasked with commanding only the zone temperature setpoint and TES charging/discharging flowrate whilst the control of the chillers or the cooling tower is not considered. Besides, trajectories of external inputs, e.g., outside air temperature and electricity price, are the same for all training days in Ref. [26]. Another similarity of Refs. [25,26] with this paper is that these references compare the performance of RL with that of a model-based predictive control.

1.1.2 Prior Work on Model Predictive Control for DCEP.

The works that are closest to us in terms of problem setting are [1315], which all reported MILP relaxation-based MPC schemes to optimally operate a DCEP with TES in presence of time-varying electricity prices. Reference [13] reports an energy cost savings with MPC of about 10% over a baseline strategy that uses a common heuristic (charge TES all night) with some decisions made by optimization. In Ref. [14], around 15% savings over the currently installed rule-based controller is achieved in a real DCEP. Reference [15] reported a cost savings of 17% over “without load shifting” with the help of the TES in a week-long simulation. Reference [16] also proposes an MILP relaxation-based MPC scheme for controlling a DCEP and approximately 10% savings in electricity cost over a baseline controller over a one-day long simulation is reported. But the DCEP model in Ref. [16] ignores the effect of weather condition on plant efficiency, and the baseline controller is not purely rule-based; it makes TES and chiller decisions based on a greedy search. Reference [17] deserves special mention since it reports an experimental demonstration of MPC applied to a large DCEP; the control objective being the manipulation of demand to help with renewable integration and decarbonization. It too uses an MILP relaxation. The decision variables include plant mode (combination of chillers on) and TES operation, but cooling water loop decisions are left to legacy rule-based controllers.

There is another body of work applying MPC to the control a DCEP, such as Refs. [912]. But they either ignore the on/off nature of the chiller control [9,10] or reformulate the problem using some heuristics [11,12] so that the underlying optimization problem is naturally an NLP.

1.2 Contribution Over Priori Arts

1.2.1 Contribution Over “Reinforcement Learning for DCEP” Literature.

Unlike most prior works on RL for DCEPs that only deal with a part of DCEP [1824], the control commands in this work consist of all the available commands (five in total) of both the chilled and cooling water loops in a full DCEP. To the best of our knowledge, no prior work has used RL to command both the water loops and a TES. Second, unlike some of the closely related work such as Ref. [26], we treat external inputs such as weather and electricity price as RL states, making the proposed RL controller applicable for any time-varying disturbances that can be measured in real-time. Otherwise, the controller is likely to work well only for disturbances seen during training. Third, the proposed RL controller commands the on/off status of chillers directly rather than the chilled/cooling water temperature setpoints [19,21,23] or zone temperature setpoints [26], which eliminates the need for another control system to translate those setpoints to chiller commands. Fourth, all the works cited above rely on discretizing the state and/or action spaces in order to use the classical tabular learning algorithms with the exception of Ref. [22]. The size of the table will become prohibitively large if the number of states and control commands becomes large and a fine-resolution discretization is used. Training a such controller and using it in real-time, which will require searching over this table, will become computationally challenging. That is perhaps why only a small number of inputs are chosen as control commands in prior work even though several more setpoints can be manipulated in a real DCEP. Although Ref. [22] considers continuous states, its proposed method only controls part of a DCEP with simplified linear plant models, which may significantly limit its potential of cost savings in reality. In contrast, the RL controller proposed in this paper is for a DCEP model consisting of highly nonlinear equations, and the states and actions are kept as continuous except for the one command that is naturally discrete (number of chillers that are on).

While there is an extensive literature on learning algorithms and on designing RL controllers, the design of an RL controller for practically relevant applications with non-trivial dynamics is quite challenging. RL’s performance depends on myriad design choices, not only on the stage cost/reward, function approximation architecture and basis functions, learning algorithm and method of exploration, but also on the choice of the state space itself. A second challenge is that training an RL controller is computationally intensive and brute force training is beyond the computational means of most researchers. For instance, the hardware cost for a single AlphaGo Zero system in 2017 by DeepMind has been quoted to be around $25 million [29]. Careful selection of the design choices mentioned above is thus required, which leads to the third challenge: if a particular set of design choices leads to a policy that does not perform well, there is no principled method to look for improvement. Although RL is being extensively studied in the control community, most works demonstrate their algorithms on plants with simple dynamics with a small number of states and inputs, e.g., Refs. [30,31]. The model for a DCEP used in this paper, arguably still simple compared to available simulation models (e.g., Ref. [32]), is quite complex: it has eight states, five control inputs, three disturbance inputs, and requires solving an optimization problem to compute the next state given the current state, control, and disturbance.

1.2.2 Contribution Over “Model Predictive Control for DCEP” Literature.

The MPC controller proposed here uses a combination of relaxation and heuristics to avoid the MINLP formulation. In contrast to Refs. [1317], the MPC controller does not use a MILP relaxation. The controller does compute discrete decisions (number chillers to be on, TES charge/discharge) directly, but it does so by using NLP solvers in conjunction with heuristics. The cost saving obtained is similar to those reported in earlier works that use MILP relaxation. Comparing other NLP formulations [912], our MPC controller determines the on/off actuation of equipments and TES charging/discharging operation directly.

Closed-loop simulations are provided for all three controllers, RL, MPC, and baseline, to assess the trade-offs among these controllers, especially between the model-based MPC controller and the “model-free” RL controller.

1.2.3 Contribution Over a Preliminary Version.

The RL controller described here was presented in a preliminary version of this paper [33]. There are three improvements. First, an MPC controller, which is not presented in Ref. [33], was designed, evaluated, and compared with our RL controller. Therefore, the optimality of our control with RL is better assessed. Another difference is that the baseline controller described here is improved over that in Ref. [33] so that the frequency of on/off switching of chillers is reduced. Lastly, a much more thorough discussion of the RL controller design choices and their observed impacts are included here than in Ref. [33]. Given the main challenge with designing RL controllers for complex physical systems discussed above, namely, “what knob to tweak when it doesn’t work?,” we believe this information will be valuable to other researchers.

2 System Description and Control Problem

The DCEP contains a TES, multiple chillers and chilled water pumps, a cooling tower and cooling water pumps, and finally a collection of buildings that uses the chilled water to provide air conditioning; see Fig. 2. The heat load from the buildings is absorbed by the cold chilled water supplied by the DCEP, and thus the return chilled water temperature is warmer. This part of the water is called load water, and the related variables are denoted by superscript lw for “load water.” The chilled water loop (subscript chw) removes this heat and transmits it to the cooling water loop (subscript cw). The cooling water loop absorbs this heat and sends it to the cooling tower, where this heat is then rejected to the ambient. The cooling tower cools down the cooling water returned from the chiller by passing both the sprayed cooling water and ambient air through a fill. During this process, a small amount of water spray will evaporate into the air, removing heat from the cooling water. The cooling water loss due to its evaporation is replenished by fresh water, thus we assume the supply water flowrate equals to the return water flowrate at the cooling tower. A fan or a set of fans is used to maintain the ambient airflow at the cooling tower. Connected to the chilled water loop is a TES tank that stores water (subscript tw). The total volume of the water in the TES tank is constant, but a thermocline separates two volumes: cold water that is supplied by the chiller (subscript twc for “tank water, cold”) and warm water returned from the load (subscript tww for “tank water, warm”).

Fig. 2
Detailed description of district cooling energy plant
Fig. 2
Detailed description of district cooling energy plant
Close modal

2.1 DCEP Dynamics.

Assuming time is discretized with a sampling period ts with a counter k = 0, 1, … denoting the time-step. With the consideration of hardware limits and ease of implementation, the control commands are chosen as follows:

  1. m˙klw, the chilled water flowrate going through the cooling coil, to ensure the required cooling load is met.

  2. m˙tw, charging/discharging flowrate of the TES, to take advantage of load shifting.

  3. nch, the number of active chillers to ensure the amount of chilled water required is met and the coldness of the chilled water is maintained.

  4. m˙cw, the flowrate of cooling water going through the condenser of chillers to absorb the heat from the chilled water loop.

  5. m˙oa, the flowrate of ambient air that cools down the cooling water to maintain its temperature within the desired range.

Therefore, the control command uk is
(1)
Each of these variables can be independently chosen as setpoints since lower level PI-control loops maintain them. There are limits to these setpoints, which determine the admissible input set U{0,,nmaxch}×R4
(2)

Since the TES can be charged and discharged, we declare m˙tw>0 for charging and m˙tw<0 for discharging as a convention.

The state of the DCEP xp is
(3)
where Stww, Stww are the fractions of the warm water and cold water in the TES tank, Stww + Stww = 1. The other state variables are temperatures at various locations—supply (subscript “, s”) and return (subscript “, r”)—in each water loop: load water, cooling water, tank water, and chiller; see Fig. 2 for details. All the plant state variables xp can be measured with sensors. The superscript “p” of x emphasizes that xp is the state of the “plant,” not the state in the reinforcement learning method that will be introduced in Sec. 3.3.

The plant state xp is affected by exogenous disturbances wkp:=[Tkoawb,q˙kL,ref]TR2, where q˙kL,ref is the required cooling load, the rate at which heat needs to be removed from buildings, and Tkoawb is the ambient wet-bulb temperature. The disturbance wkp cannot be ignored, e.g., ambient wet-bulb temperature plays a critical role in cooling tower dynamics.

The control command and disturbances affect the state through a highly nonlinear dynamic model
(4)
that is described in detail in Ref. [34]. The dynamics (4) are implicit: there is no explicit function f(·) that can be evaluated to obtain xk+1. The reason is that all the heat exchangers (in the building cooling coils, in each chiller, and in the cooling tower) have limited capacities. Depending on the cooling load, number of active chillers, and the outdoor air wet-bulb temperature, one of the heat exchangers might saturate. Meaning, it will then only deliver as much exchange as its capacity, less than what is desired due to the load. Which heat exchanger will saturate first depends on the current state and disturbance and control in a complex manner. Hence, some form of iterative computation is required to simulate the dynamics, e.g., the method developed in Ref. [35]. A generalized way to perform the iterative update to account for the limits of heat exchange capacities is by solving a constrained optimization problem, which is the method used in this work. The method is described in detail in Ref. [34].
Here we provide an outline for use in the sequel. First, define the decision variable zk as
(5)
where xp is defined in Eq. (3), q˙L is the cooling load met by the DCEP, q˙ch and q˙ct are the cooling provided by chillers and cooling towers. Then the value of zk is computed by solving the following optimization problem
(6)
where Tsetchw,s and Tsetcw,s are pre-specified setpoints that reflect DCEP nominal working conditions and r1, r2, and r3 are positive design choices, with r1r2, r3 to promote load tracking. The set Ω(xkp,wkp,uk) is defined by the dynamics and constraints of the DCEP system, including the dynamics of the various heat exchangers and the TES, and the capacity limits of the heat exchangers in the buildings’ air handling units, chillers, and the cooling tower. Please refer to Ref. [34] for the derivation of Ω(xkp,wkp,uk). When the required cooling q˙kL,ref is within the capacity of all the heat exchangers, then the solution to Eq. (6) yields q˙kL=q˙kL,ref. When the required load exceeds the capacity of the DCEP, then Eq. (6) will lead to a solution that trades off maintaining nominal setpoints and meeting the cooling load, while respecting the limits of the heat exchangers. The solution leads to the next state xk+1p (as the first component of z*k), and thus Eq. (6) implicitly defines the model f(·). In this paper, we use casadi/ipopt [36,37] to solve Eq. (6) for simulating the plant.

2.2 Electrical Demand and Electricity Cost.

In the DCEP considered, the only energy used is electricity. The relationship between the thermal quantities and the electricity consumption in chillers and cooling tower are complex. We model the chillers power consumption Pch as [38]
(7)
Power consumption of water pumps is modeled using the black-box model in Ref. [13]
(8)
(9)
Finally, the electrical power consumption of the cooling tower mainly comes from its fan and is modeled as [39]
(10)
The constants αi, βi, γi, and λ are empirical parameters. The total electric power consumption of the DCEP is
(11)

2.3 Model Calibration and Validation.

The parameters of the simulation model in Sec. 2.1 and electrical demand model in Sec. 2.2 are calibrated using data from the energy management system in United World College (UWC) of South East Asia Tampines Campus in Singapore, shown in Fig. 3(b). The data are publicly available in Ref. [40], and details of the data are discussed in Ref. [41]. There are three chillers and nine cooling towers in the DCEP. The data from chiller one and cooling tower one are used for model calibration. We use 80% of data for model identification and 20% of data for verification. The out-of-sample prediction results for the total electrical demand are shown in Fig. 3. Comparison between data and prediction for other variables are not shown in the interest of space.

Fig. 3
(Top) Map of the campus with a DCEP whose data are used for model calibration, and (bottom) out of sample prediction for Pch using the calibrated model (7), Toadb is the ambient dry-bulb temperature: (a) schematic of the UWC Tampines Campus, from Ref. [40] and (b) chiller power measurement Pch and prediction P^ch
Fig. 3
(Top) Map of the campus with a DCEP whose data are used for model calibration, and (bottom) out of sample prediction for Pch using the calibrated model (7), Toadb is the ambient dry-bulb temperature: (a) schematic of the UWC Tampines Campus, from Ref. [40] and (b) chiller power measurement Pch and prediction P^ch
Close modal

2.4 The (Ideal) Control Problem.

The electricity cost incurred during the k th time-step is
(12)
where Pktot is the total electric power consumed in k and is defined in Eq. (11). The goal of operating the DCEP to minimize electricity cost while meeting the required cooling load q˙kL,ref can be posed as the following infinite horizon optimal control problem:
(13)
(14)
(15)
where ρk (USD/kWh) is the electricity price. The state xkp, input uk, and disturbance wkp of the DCEP are defined in Sec. 2.1; q˙kL (“L” stands for “load”) represents the actual cooling load met by the DCEP, which is a function of xkp and uk. The bounds for xkp and uk are Xp(wk) and U(xkp,wk). The reason these sets are dependent on the state or disturbance can be found in the description of the dynamic model of the plant in Ref. [34].

Even when truncated to a finite planning horizon considered, Eq. (13) is an MINLP due to nkch being an integer and the nonlinear dynamics (4). In the sequel, we propose two controllers to solve approximations of this idealized problem.

3 Reinforcement Learning Basics and Proposed Reinforcement Learning Controller

3.1 Reinforcement Learning Basics.

For the following construction, let x represent the state with state space X and u the input with input space U(x). Now consider the following infinite horizon discounted optimal control problem:

(16)
where UΔ={u0,,}, c:X×UR0 is the stage cost, γ ∈ (0, 1) is the discount factor, F(·, ·) defines the dynamics, and J*:XR+ is the optimal value function. The goal of the RL framework is to learn an approximate optimal policy ϕ:XU for the problem (16) without requiring explicit knowledge of the model F(·, ·). The learning process is based on the Q function. Given a policy ϕ for Eq. (16), the Q function associated with this policy is defined as
(17)
where for k ≥ 0 we have xk+1 = F(xk, uk) and for k ≥ 1 we have uk = ϕ(xk). A well-known fact is that the optimal policy satisfies [42]
(18)
where Q*:=Qϕ* is the Q function for the optimal policy. Further, for any policy ϕ the Q function satisfies the following fixed point relation:
(19)
for all uU(x), xX, and x+ = F(x, u). The above relation is termed here as the fixed-policy Bellman equation. If the optimal Q-function can be learned, the optimal control command u*k is computed from the Q-function as
(20)

3.2 Proposed Reinforcement Learning Algorithm.

The proposed learning algorithm has two parts: policy evaluation and policy improvement. First, in policy evaluation, a parametric approximation to the fixed policy Q-function is learned by constructing a residual term from Eq. (19) as an error to minimize. Second, in policy improvement, the learned approximation is used to define a new policy based on Eq. (18). For policy evaluation, suppose for a policy ϕ the Q function is approximated as
(21)
where Qϕθ(,) is the function approximator (e.g., a neural network) and θRd is the parameter vector (e.g., weights of the network). To fit the approximator, suppose that the system is simulated for Tsim time so that Tsim tuples of (xk, uk, xk+1) are collected to produce Tsim values of
(22)
which is the temporal difference error for the approximator. We then obtain θ* by solving the following optimization problem:
(23)
where D(θ)Δ=[d0(θ),,dTsim1(θ)]. The term θθ¯2 is a regularizer and α is a gain. The values of θ¯ and α are specified in step (3) of Algorithm 1. The non-negativity constraint on the approximate Q-function is imposed since the Q-function is a discounted sum of non-negative terms (17). How it is enforced is described in Sec. 3.3.3. The solution to Eq. (23) results in Qϕθ*, which is an approximation to Qϕ. The quantity Qϕθ* can be used to obtain an improved policy, denoted ϕ+, through
(24)
This process of policy evaluation (23) and policy improvement (24) are repeated. This iterative procedure is described formally in Algorithm 1, with Npol denoting the number of policy improvements.

This algorithm is inspired by: (i) the Batch Convex Q-learning algorithm found in Ref. [27, Sec. III] and (ii) the LSPI algorithm [28]. The approach here is simpler than the batch optimization problem that underlies the algorithm in Ref. [27, Sec. III], which has an objective function that itself contains an optimization problem. In comparison to Ref. [28] we include a regularization term that is inspired by proximal methods in optimization that aid convergence, and a constraint to ensure the learned Q-function is non-negative.

Algorithm 1

 Result: An approximate optimal policy ϕNpol(x).

Input:Tsim, θ0, Npol, β>1

Forj=0,,Npol1do

(1) Follow an exploration stategy and obtain input sequence {ukj}k=0Tsim1, initial state x0j, and state sequence {xkj}k=1Tsim.

(2) For k=1,,Tsim, obtain: ϕj(xk)=argminuU(xkj)Qϕθj(xkj,u).

(3) Set θ¯=θj and α=jβ appearing in Eq. (23).

(4) Use the samples {ukj}k=0Tsim1, {xkj}k=0Tsim, and {ϕj(xk)}k=1Tsim to construct and solve Eq. (23) for θ*.

(5) Set θj+1=θ*.

end

3.3 Proposed Reinforcement Learning Controller for DCEP.

We now specify the ingredients required to apply Algorithm 1 to obtain an RL controller (i.e., a state-feedback policy) for the DCEP from simulation data. Namely, (1) the state description, (2) the cost function design, (3) the approximation architecture, and (4) the exploration strategy. Parts (1), (2), and (3) refer to the setup of the optimal control problem that the RL algorithm is attempting to approximately solve. Part (4) refers to the selection of how the state/input space is explored (step (1) in Algorithm 1).

3.3.1 State Space Description.

In RL, the construction of the state space is an important feature, and the state is not necessarily the same as the plant state. To define the state space for RL, we first denote wk as the vector of exogenous variables
(25)
where ρ¯k=1τt=kτkρt is a backwards moving average of the electricity price. The expanded state for RL is
(26)
Note that with the state defined by Eq. (26), a state-feedback policy is implementable since all entries of xk can be measured with commercially available sensors (e.g., outside wet-bulb temperature, Toawb), or estimated from measurements (e.g., the thermal load from buildings, q˙L,ref), or known via real-time communication (e.g., the electricity prices, ρk and ρ¯k).

3.3.2 Design of Stage Cost.

The design of the stage cost is also an important aspect of RL. We wish to obtain a policy that tracks the load q˙kL,ref whilst spending a minimal amount of money, as described in Sec. 2.4. Therefore we choose
(27)
where κ is a design parameter; κ ≫ 1 will prefer load tracking over energy cost.

3.3.3 Approximation Architecture.

We choose the following linear-in-the-parameter approximation of the Q function:
(28)
where ψ(x, u) are nonlinear basis functions and θRd is the parameter vector. We elect a quadratic basis, so that each ψ(x, u) is of the form xu, x2, or u2. Specifically, a subset of all possible quadratic terms is chosen as the basis. More on this subset is provided in Sec. 7. We can equivalently express the approximation (28) as
(29)
for appropriately chosen Pθ. In this form, it is straightforward to enforce the constraint in Eq. (23) by enforcing the convex constraint Pθ0.

3.3.4 Exploration Strategy.

Exploration refers to how the state/input sequences appearing in step (1) of Algorithm 1 are simulated. We utilize a modified ε greedy exploration scheme. At time-step k of iteration j, we obtain the input ukj from one of three methods: (i) by using the policy in step (2) of Algorithm 1, (ii) electing uniformly random feasible inputs, and (iii) using a rule-based baseline controller (described in Sec. 5). The states are obtained sequentially through simulation, starting from state x0j for each j. The choice to use either of the three controllers is determined by the probability mass function νexpjR3, which depends on the iteration index of the policy iteration loop
(30)

The entries correspond to the probability of using the corresponding control strategy, which appears in the (i)–(iii) order as just introduced. The rationale for this choice is that the BL controller provides “reasonable” state input examples for the RL algorithm in the early learning iterations to steer the parameter values in the correct direction. After this early learning phase, weight is shifted toward the current working policy to force the learning algorithm to update the parameter vector in response to its actions.

3.3.5 Training Settings.

The policy evaluation problem (23) during training is solved using cvx [43]. The simulation model (4) to generate state updates, which requires solving a non-convex NLP, is solved using casadi and ipopt [36,37].

The parameters used for RL training are γ = 0.97, d = 36, κ = 500, β = 100, Tsim=432, and Npol=50. The parameter τ for the backward moving average filter on the electricity price is chosen to represent 4 h. The choice of the 36 basis functions is a bit involved; they are discussed in Sec. 7. Because a simulation time-step, k to k + 1, corresponds to a time interval of 10 min, Tsim=432 corresponds to 3 days. The controller was trained with weather and load data for the 3 days (Oct. 10–12, 2011), from the Singapore UWC campus dataset described in Sec. 2.3. The electricity price data used for training were taken as a scaled version of the locational marginal price from PJM [44] for the 3 days (Aug. 30–Sept 1, 2021).

3.4 Real-Time Implementation.

Once the RL controller is trained, it computes the control command uk in real-time as
(31)
where θ^ is the parameter vector learned in Algorithm 1. This θ^ needs not to be θNpol but the one with the best closed-loop performance, which is explained later in Sec. 6.2.

Due to the non-convexity of the set U(xk) and the integer nature of nkch, Eq. (31) is non-convex and integer-valued. We solve it using exhaustive search: for each possible value of nkch, we solve the corresponding continuous variable nonlinear program using casadi/ipopt [36,37], and then choose the minimum out of (nmaxch+1) solutions by direct search. Direct search is feasible because nmaxch for DCEP is a small number in practice (nmaxch=7 in our simulated example).

4 Proposed Model Predictive Controller

Recall that a straightforward translation of Eq. (13) to MPC will require solving the following problem at every time index k (here we only describe the one at k = 0 to avoid cumbersome notation)
(32)
where ckE is defined in Eq. (12), and Tplan is the planning horizon. Even for a moderate planning horizon Tplan the optimization problem (32) will be a large MINLP. We now describe an algorithm that uses a dynamic model of the DCEP to approximately solve Eq. (32) without needing to solve an MINLP or even an MILP. This algorithm, which we call model-based sub-optimal solution (MBOC), is then used to implement MPC by repeatedly applying it in a receding horizon setting as new forecasts of external disturbances become available.

The first challenge we have to overcome is not related to the mixed-integer nature of the problem but is related to the complex nature of the dynamics. Recall from Sec. 2.1 that the dynamic model, i.e., the function f in the equality constraint xk+1 = f(·) in Eq. (4) is not available in explicit form; rather the state is propagated in the simulation by solving an optimization problem. Without an explicit form for the function f(·), modern software tools that reduce the drudgery in nonlinear programming, namely numerical solvers with automatic differentiation, cannot be used.

We address this challenge by substituting the implicit equality constraint xk+1p=f(xkp,uk,wkp) in Eq. (32) with the underlying constraints Ωk(·) in Eq. (6), and add the objective of Eq. (6) to the objective of Eq. (32). The modified problem becomes
(33)
Since the input nkch takes integer value in the set {0,1,,nmaxch}, Eq. (33) is still a high-dimensional MINLP.

The proposed algorithm to approximately solve Eq. (33) without using an MINLP solver or an MILP relaxation consists of three steps. These are listed below in brief, with more details provided subsequently.

  1. The integer variable nch[0,1,,nmaxch] is relaxed to a continuous one nch,c[0,nmaxch]. The relaxed problem, an NLP, is solved using an NLP solver to obtain a locally optimal solution. In this paper, we use ipopt (through casadi) to solve this relaxed NLP.

  2. The continuous solution {nkch,c}k=0Tplan1RTplan, resulting from step 1, is processed by using Algorithms 2 and 3 to produce a transformed solution that is integer-valued, which is denoted by {nkch,d}k=0Tplan1.

  3. In Eq. (33), the input {nkch}k=0Tplan1={nkch,d}k=0Tplan1 is fixed at the values obtained in step 2, and the resulting NLP is solved again. The resulting solution is called the MBOC.

In the sequel, we will refer to a vector with non-negative integer components, xZn, as an n-length discrete signal. For a discrete signal xZn, the number of switches, Nswitch, is defined as the number of times two consecutive entries differ: Nswitch:=i=1n1I(xixi+1), where I(·) is the indicator function: I(0) = 0 and I(y) = 1 for y ≠ 0.

The continuous relaxation in step 1 is inspired by branch and bound algorithms for solving MINLPs, since such a relaxation is the first step in branch and bound algorithms. However, a simple round-off-based method to convert the continuous variable nch,c to a discrete one leads to a high number of oscillations in the solution. This corresponds to frequent turning on and off of one or more chillers, which is detrimental to them.

Step 2 converts the continuous solution from step 1 to a discrete signal, and involves multiple steps in itself. The first step is Algorithm 2, which filters the signal nch,c with a modified moving average filter with a two-hour window (corresponding to 12 samples with a 10 min sampling period) and then rounding up the filtered value to the nearest integer. Thus by operating the moving average filter on nch,c one obtains a discrete signal for the chiller command nch,f=moving_average_round(nch,c).

Algorithm 2

Input: Signal xZn, wZ+ (window length)

 for i = 1: w

  xd[i] = mean(x[1:i+w/2])

 end

 for i=w/2+1:nw/2

  xd[i] = mean(x[iw/2:i+w/2])

 end

 for i=nw/2+1:n

  xd[i] = mean(x[iw/2:end])

 end

Output: Discrete signal xd.

The rounding moving average filter typically does not reduce the switching frequency sufficiently. This is why an additional step, Algorithm 3, described below, is used to operate on this signal and produce the output nch,d:=reduce_switching(nch,f) that has fewer switches.

xrs = reduce switching(x)

Algorithm 3

Input: Discrete signal xZn and wZ+ (window length)

 1: Obtain indices of the entries of x that are not to be changed, index_freezed, as follows:

  Initialize index_freezed = zeros(n,1) # Array of dimension n with all entries zero

  for i=1:n

   if Nswitch(x[iw:i]) = 0

    index_freezed[i]1

   end

 2: Initialize xrs: xrs[i]=x[i] for each i such that index_freezed[i] = 1.

 3: For each i in index_freezed which is 0:

   Find all the consecutive 0 entries till the next 1. Let these indices be Isi, and define yi=mean(x[Isi]).

   Set xrs[j]yi for every iIsi.

   Set index_freezed[Isi]1

end

Output:xrs.

The need for step 3 is that the chiller command {nch,d} at the end of the second step, together with other variables in the solution vector from Step 1, may violate some constraints of the optimization problem (33). Even if {xk+1p} and {nch,d} are feasible, the resulting control commands may not track the cooling load adequately. Step 3 ensures a feasible solution and improves tracking.

Forecasts: Implementation requires the availability of the forecasts of disturbance wkp, i.e., cooling load reference and electricity price, over the next planning horizon. There is a large literature on estimating and/or forecasting loads for buildings and for real-time electricity prices; see Refs. [4547] and references therein. The forecast of Tkoawb is available from National Weather Service [48]. We therefore assume the forecasts of the three disturbance signals, q˙kL,ref, Tkoawb, and ρk, are available to the MPC controller at each k.

5 Rule-Based Baseline Controller

In order to evaluate the performances of the RL and MPC controllers, we will compare them to a rule-based BL. The proposed baseline controller is designed to utilize the TES and time-varying electricity prices (to the extent possible with heuristics) to reduce energy costs. The RL controller and baseline controller have the same information about the price: the current price ρk and a backward moving average ρ¯k. At each time-step k, the baseline controller determines the control command uk=[m˙klw,m˙ktw,nkch,m˙kcw,m˙koa]T following the procedure shown in Fig. 4. The flowcharts are elaborated in Ref. [34] and briefly explained in Secs. 5.1 and 5.2. The subscript “sat” indicates the variable is saturated at its upper or lower bounds; the numerical values of the bounds used in simulations are shown in Table 1. For estimating the outputs under nominal conditions and the time-dependent bounds, please refer to Ref. [34].

Fig. 4
Baseline controller
Fig. 4
Baseline controller
Close modal
Table 1

Simulation parameters

ParaUnitValueParaUnitValue
tsmin10tsTplan60h24
τts60h4wts60h2
nmaxchN/A7m˙max/mintwkg/s30/−30
m˙max/minlwkg/s350/20m˙max/mincwkg/s300/20
Smaxtwc/twwN/A0.95Smintwc/twwN/A0.05
m˙indvkg/s50q˙indvchkW1046
Tmax/minlw,r°C15/5Tmax/minchw,s°C10/5
Tmax/mincw,r°C40/25m˙max/minoakg/s1.25/1
Tsetcw,s°C29Tsetchw,s°C7
ParaUnitValueParaUnitValue
tsmin10tsTplan60h24
τts60h4wts60h2
nmaxchN/A7m˙max/mintwkg/s30/−30
m˙max/minlwkg/s350/20m˙max/mincwkg/s300/20
Smaxtwc/twwN/A0.95Smintwc/twwN/A0.05
m˙indvkg/s50q˙indvchkW1046
Tmax/minlw,r°C15/5Tmax/minchw,s°C10/5
Tmax/mincw,r°C40/25m˙max/minoakg/s1.25/1
Tsetcw,s°C29Tsetchw,s°C7

5.1 For Chilled Water Loop.

  1. At time-step k, nkch, m˙klw, and m˙ktw are initialized to nk1ch, m˙k1lw, and m˙k1tw.

  2. The BL controller increases or decreases m˙ktw by a fixed amount (10 kg/s) if ρk is 5% lower or higher than ρ¯k in order to take advantage of time-varying electricity price.

  3. The BL controller estimates Tk+1lw,r, Tk+1chw,s, Sk+1twc under the assumption of m˙bp=0 and q˙kch=nkchq˙indvch. If Tk+1lw,r, Tk+1chw,s, Sk+1twc are within their bounds, the current control command for the chilled water loop is executed. Otherwise, the controller repeatedly increases/decreases m˙klw and m˙ktw by a fixed amount (10 kg/s), and nkch by 1 until Tk+1lw, r, Tk+1chw,s, and Sk+1twc are within their bounds. Since m˙klw+m˙ktw determines the minimum required nkch, the final nkch is readjusted to meet the minimum required nkch.

5.2 For Cooling Water Loop.

  1. m˙kcw and m˙koa are initialized to m˙k1cw and m˙k1oa.

  2. The BL controller estimates Tk+1cw,r by assuming a fixed fraction of electric power consumed by chillers is added into the cooling water loop. This fraction is to be estimated from historical data. If Tk+1cw,r is above/below its bound, m˙kcw is increased/decreased by a fixed amount (20 kg/s) repeatedly until Tk+1cw,r is within its bound.

  3. Once m˙kcw is determined, the capacity of cooling tower q˙UB,kct and the required cooling q˙set,kct that cools down Tkcw,r to Tsetcw,s is computed. If q˙set,kctq˙UB,kct1.1q˙set,kct, then the current control command for the cooling water loop is executed. If q˙UB,kct<q˙set,kct or q˙UB,kct>1.1q˙set,kct, m˙koa is increased or decreased by a fixed amount (0.05 kg/s). Since q˙UB,kct depends on the ambient wet-bulb temperature Tkoawb (illustrated in Ref. [34]), there can be a case that q˙ctUB,k cannot satisfy q˙set,kctq˙UB,kct1.1q˙set,kct even when m˙koa is already at its bound. In this case, m˙kcw is varied by a fixed amount (20 kg/s) repeatedly until Tk+1cw,r and q˙UB,kct are within their bounds.

6 Performance Evaluation

6.1 Simulation Setup.

Simulations for closed-loop control with RL, MPC, and baseline controllers are performed for the week of Sept. 6–12, 2021, which we refer to as the testing week in the sequel. The weather data for the testing week are obtained from the Singapore data set described in Sec. 2.3. The real-time electricity price used is a scaled version of PJM’s locational margin price for the same week [44]. Other relevant simulation parameters are located in Table 1. There is no plant-model mismatch in the MPC simulations. In particular, since the forecasts of disturbance signals are available in practice (see the discussion at the end of Sec. 4), in the simulations the MPC controller is provided with error-free forecasts in the interest of simplicity.

We emphasize that the closed-loop results with the RL controller presented here are “out-of-sample” results, meaning the external disturbance wk (weather, cooling load, and electricity price) used in the closed-loop simulations are different from those used in training the RL controller.

Four performance metrics are used to compare the three controllers. The first is the energy cost incurred. The second is the root mean square error (RMSE) in tracking the cooling load reference
(34)
where Nsim is the duration for which closed-loop simulations are carried out, which in this paper is 1008 (corresponding to a week: 7 × 24 × 6). The third is the number of chiller switches over the simulation period
(35)
Fast cycling decreases the life expectancy of a chiller greatly. The fourth is control computational time during closed-loop simulations.

6.2 Numerical Results and Discussion.

A summary of performance comparisons from the simulations is shown in Table 2. All three controllers meet the cooling load adequately (more on this later), and both the RL and MPC controllers reduce energy cost over the baseline by about the same amount (16.8% for RL versus 17.8% for MPC). These savings are comparable with those reported in the literature for MPC with MILP relaxation and RL.

Table 2

Comparison of RL, MPC, and baseline controllers (for a week-long simulation)

Total cost ($)eRMSE (kW)No. of switchesControl computation time (s, μ ± σ)
Baseline33084.14 × 104458.9 × 105 ± 3.9 × 104
RL27521.851140.32 ± 0.01
MPC271961.386527.33 ± 5.99
Total cost ($)eRMSE (kW)No. of switchesControl computation time (s, μ ± σ)
Baseline33084.14 × 104458.9 × 105 ± 3.9 × 104
RL27521.851140.32 ± 0.01
MPC271961.386527.33 ± 5.99

In terms of tracking the reference load, both RL and MPC again perform similarly while the baseline controller performs the best in terms of the standard deviation of tracking error; see Fig. 5 and Table 2. The worst tracking RMSE is 61 kW, which is a small fraction of the mean load (1313 kW). Thus the tracking performance is considered acceptable for all three controllers. The fact that the baseline performs the best in tracking the cooling load is perhaps not surprising since it is designed primarily to meet the required load and keep chiller switching low, with energy cost a secondary consideration.

Fig. 5
Load tracking performances of the MPC, RL, and BL controllers: the “Ref” is cooling required q˙kL,ref
Fig. 5
Load tracking performances of the MPC, RL, and BL controllers: the “Ref” is cooling required q˙kL,ref
Close modal

In terms of chiller switches, the RL controller performs the worst; see Table 2. This is not surprising because no cost was assigned to higher switching in its design. The MPC performs the best in this metric, again most likely since keeping switching frequency low was an explicit consideration in its design. Ironically, this feature was introduced into the MPC controller after an initial design attempt without it, which led to a high switching frequency.

In terms of real-time computation cost, the baseline performs the best, which is not surprising since no optimization is involved. The RL controller has two orders of magnitude lower computation cost compared to MPC. The computation time for all controllers is well within the time budget since control commands are updated every 10 min.

Deeper look: Simulations are done for a week, but the plots below show only 2 days to avoid clutter. The cost savings by RL and MPC controller come from their ability to use the TES to shift the peak electric demand to periods of low price better than that of the baseline controller; see Fig. 6. The MPC controller has the knowledge of electricity price along the whole planning horizon, and thus achieves the most savings. The cause for the cost-saving differences between BL and RL controllers is that the RL controller learns the variation in the electricity price well, or at least better than the BL controller. This can be seen in Fig. 7. The RL controller always discharges the TES (Stwc drops) during the peak electricity price while the baseline controller sometimes cannot do so because the volume of cold water is already at its minimum bound. The BL controller discharges the TES as soon as the electricity price rises, which may result in insufficient cold water stored in the TES when the electricity price reaches its maximum. While both the RL and BL controllers are forced to use the same price information (current and a backward moving average), the rule-based logic in the baseline controller cannot use that information as effectively as RL.

Fig. 6
Power consumption versus real-time electricity price for the MPC, RL, and BL controllers
Fig. 6
Power consumption versus real-time electricity price for the MPC, RL, and BL controllers
Close modal
Fig. 7
TES cold water volume versus real-time electricity price for the MPC, RL, and BL controllers
Fig. 7
TES cold water volume versus real-time electricity price for the MPC, RL, and BL controllers
Close modal

An alternate view of this behavior can be obtained by looking at the times when the chillers are turned on and off, since using chillers costs much more electricity than using the TES, which only needs a few pumps. We can see from Fig. 8 that all controllers shift their peak electricity demand to the times when electricity is cheap. But the rule-based logic of the BL controller is not able to line up electric demand with the low price as well as the RL and MPC controllers do.

Fig. 8
Required cooling load versus real-time electricity price for the MPC, RL, and BL controllers
Fig. 8
Required cooling load versus real-time electricity price for the MPC, RL, and BL controllers
Close modal

Another benefit of the RL controller is that it typically activates fewer chillers than the BL controller, though the cost of running active chillers is not incorporated in the cost function; see Fig. 9. This effect may increase the life expectancy of the DCEP.

Fig. 9
Number of active chillers versus real-time electricity price for the MPC, RL, and BL controllers
Fig. 9
Number of active chillers versus real-time electricity price for the MPC, RL, and BL controllers
Close modal

7 Under the Hood of the Reinforcement Learning Controller

More insights about why the learned policy works under various conditions can be obtained by taking a closer look at the design choices made for the RL controller. All these choices were the result of considerable trial-and-error.

Choice of basis functions: The choice of basis to approximate the Q-function is essential to the success of the RL controller. It defines the approximate Q-function, and consequently the policy (31). Redundant basis functions can lead to overfitting, which causes poor out-of-sample performance of the policy. We avoid this effect by selecting a reduced quadratic basis, which are the 36 unique non-zero entries shown in Fig. 10. Another advantage of reducing the number of basis functions is that it reduces the number of parameters to learn, as training effort increases dramatically with the number of parameters to learn.

Fig. 10
Sparsity pattern of the matrix Pθ appearing in Eq. (29)
Fig. 10
Sparsity pattern of the matrix Pθ appearing in Eq. (29)
Close modal

The choices for the basis were based on physical intuition about the DCEP. First, basis functions can be simplified by dropping redundant states. One example is Stww. Since Stwc and Stww are dual terms: Stwc + Stww = 1, so one of them can be dropped. Considering that the Stwc reflects the amount of cooling saved in the TES, we dropped Stww. Another example is the term Ttww, which is dropped since it is bounded by Tlw,r which is already included in the basis function. Second, if two terms have a strong causal or dependent relationship, e.g., m˙1w and T1w,r then the corresponding quadratic term m˙1wT1w,r should be selected as an element of the basis. Third, if two terms have minimal causal or dependent relationship, e.g., m˙oa and m˙tw (they are from different equipment and water loops), then the corresponding quadratic term m˙oam˙tw should not be selected as an element of the basis.

Choice of States: Exogenous disturbances have to be included into the RL states to make the controller work under various cooling load, electricity price, and weather trajectories that are distinct from what is seen during training. Without this feature, the RL controller will not be applicable in the field.

Convergence of the learning algorithm: The learning algorithm appears to converge in training, meaning, |θkθk−1| is seen to reduce as the number of training epochs k increases; see Fig. 11. This convergence should not be confused with convergence to any meaningful optimal policy. The policy learned in the 40th iteration can be a better-performing controller than the policy obtained in 50th iteration. We believe the proximal gradient type method used in learning helps in the parameters not diverging, but due to the same reason it may prevent the parameters from converging to a far away optima. This trade-off is necessary: our initial attempts without the damping proximal term were not successful in learning anything useful. As a result, after a few policy improvement iterations, every new policy obtained had to be tested by running a closed-loop simulation to assess its performance. The best performing one was selected as “the RL controller,” which happened to be the 26th one.

Fig. 11
Values of θ versus policy iteration index. Only five θ’s are shown to avoid clutter.
Fig. 11
Values of θ versus policy iteration index. Only five θ’s are shown to avoid clutter.
Close modal

Numerical considerations for training: Training of the RL controller is an iterative task that requires trying many various configurations of the parameters appearing in Table 1. In particular, we found the following considerations useful:

  1. If the value of κ is too small, the controller will not learn to track the load q˙kL,ref. On the other hand, if κ is too large the controller will not save energy cost. The chosen κ in Sec. 3.3.5 is determined by trial-and-error.

  2. The condition number of Eq. (23) significantly affects the performance of Algorithm 1. However, the relative magnitudes of state and input values are very different, for example, q˙L[300,4000] (kW) and STWC ∈ [0.05, 0.95], which makes the condition number of Eq. (23) extremely large. Therefore, we normalize all magnitudes of state and input values with their average values. With appropriate scaling of the states/inputs, we reduced the magnitude of the condition number from 1 × 1020 to 1 × 103.

8 Conclusion

The proposed MPC and RL controllers are able to reduce energy cost significantly, around 17% in a week-long simulation, over the rule-based baseline controller. Apart from the dramatically lower real-time computationally cost of the RL controller compared to the MPC, load tracking and energy cost-saving performances of the two controllers are similar. This similarity in performance is somewhat surprising. Though both controllers are designed to be approximations of the same intractable infinite horizon problem, there are nonetheless significant differences between them, especially the information the controllers have access to and the objectives they are designed to minimize. It should be noted that the MPC controller has a crucial advantage over the RL controller in our simulations: the RL controller has to implicitly learn to forecast disturbances while the MPC controller is provided with error-free forecasts. How much will MPC’s performance degrade in practice due to inevitable plant-model mismatch is an open question.

Existing work on RL and on MPC tend to lie in their own silos, with comparisons between them for the same application being rare. This paper contributes to such comparisons for a particular application: control of DCEPs. Much more remains to be done, such as an examination of robustness to uncertainties.

There are several other avenues for future work. One is to explore nonlinear bases, such as neural networks, for designing an RL controller. Another is to augment the state space with additional signals, especially with forecasts, which might improve performance. Of course, such augmentation will also increase the cost and complexity of training the policy. Another avenue for improvement in the RL controller is to reduce the number of chiller switches. In this paper, all the chillers are considered to be the same. An area of improvement is to extend heterogeneous chillers with distinct performance curves, for both RL and MPC. On the MPC front, an MILP relaxation is a direction to pursue in the future.

Acknowledgement

The research reported here has been partially supported by the NSF through award 1934322 (CMMI) and 2122313 (ECCS).

Conflict of Interest

There are no conflicts of interest.

Data Availability Statement

The datasets generated and supporting the findings of this article are obtainable from the corresponding author upon reasonable request.

References

1.
U.S. Energy Information Administration
,
2012
, “Commercial Buildings Energy Consumption Survey (CBECS): Overview of Commercial Buildings, 2012,” Technical Report, Energy Information Administration, Department of Energy, US Government, December.
2.
Pacific Gas and Electric Company
,
1997
, Thermal Energy Storage Strategies for Commercial HVAC Systems. Application Note.
3.
Hydeman
,
M.
, and
Zhou
,
G.
,
2007
, “
Optimizing Chilled Water Plant Control
,”
ASHRAE J.
,
49
, pp.
45
54
.
4.
Teleke
,
S.
,
Baran
,
M. E.
,
Bhattacharya
,
S.
, and
Huang
,
A. Q.
,
2010
, “
Rule-Based Control of Battery Energy Storage for Dispatching Intermittent Renewable Sources
,”
IEEE Trans. Sustain. Energy
,
1
(
3
), pp.
117
124
.
5.
Tam
,
A.
,
Ziviani
,
D.
,
Braun
,
J.
, and
Jain
,
N.
,
2018
, “
A Generalized Rule-Based Control Strategy for Thermal Energy Storage in Residential Buildings
,”
International High Performance Buildings Conference
,
West Lafayette, IN
,
July 9–12
.
6.
Pinamonti
,
M.
,
Prada
,
A.
, and
Baggio
,
P.
,
2020
, “
Rule-Based Control Strategy to Increase Photovoltaic Self-Consumption of a Modulating Heat Pump Using Water Storages and Building Mass Activation
,”
Energies
,
13
(
23
), p.
6282
.
7.
Lee
,
K.-H.
,
Joo
,
M.-C.
, and
Baek
,
N.-C.
,
2015
, “
Experimental Evaluation of Simple Thermal Storage Control Strategies in Low-Energy Solar Houses to Reduce Electricity Consumption During Grid On-Peak Periods
,”
Energies
,
8
(
9
), pp.
9344
9364
.
8.
Schibuola
,
L.
,
Scarpa
,
M.
, and
Tambani
,
C.
,
2015
, “
Demand Response Management by Means of Heat Pumps Controlled Via Real Time Pricing
,”
Energy Build.
,
90
, pp.
15
28
.
9.
Ma
,
Y.
,
Kelman
,
A.
,
Daly
,
A.
, and
Borrelli
,
F.
,
2012
, “
Predictive Control for Energy Efficient Buildings With Thermal Storage: Modeling, Stimulation, and Experiments
,”
IEEE Control Syst. Mag.
,
32
(
1
), pp.
44
64
.
10.
Cole
,
W. J.
,
Edgar
,
T. F.
, and
Novoselac
,
A.
,
2012
, “
Use of Model Predictive Control to Enhance the Flexibility of Thermal Energy Storage Cooling Systems
,”
American Control Conference
,
Montreal, Canada
,
June 27–29
, pp.
2788
2793
.
11.
Touretzky
,
C. R.
, and
Baldea
,
M.
,
2014
, “
Integrating Scheduling and Control for Economic MPC of Buildings With Energy Storage
,”
J. Process Control
,
24
(
8
), pp.
1292
1300
.
12.
Zabala
,
L.
,
Febres
,
J.
,
Sterling
,
R.
,
López
,
S.
, and
Keane
,
M.
,
2020
, “
Virtual Testbed for Model Predictive Control Development in District Cooling Systems
,”
Renewable Sustainable Energy Rev.
,
129
, p.
109920
.
13.
Risbeck
,
M. J.
,
Maravelias
,
C. T.
,
Rawlings
,
J. B.
, and
Turney
,
R. D.
,
2017
, “
A Mixed-Integer Linear Programming Model for Real-Time Cost Optimization of Building Heating, Ventilation, and Air Conditioning Equipment
,”
Energy Build.
,
142
, pp.
220
235
.
14.
Rawlings
,
J. B.
,
Patel
,
N. R.
,
Risbeck
,
M. J.
,
Maravelias
,
C. T.
,
Wenzel
,
M. J.
, and
Turney
,
R. D.
,
2018
, “
Economic MPC and Real-Time Decision Making With Application to Large-Scale HVAC Energy Systems
,”
Comput. Chem. Eng.
,
114
, pp.
89
98
.
15.
Patel
,
N. R.
,
Risbeck
,
M. J.
,
Rawlings
,
J. B.
,
Maravelias
,
C. T.
,
Wenzel
,
M. J.
, and
Turney
,
R. D.
,
2018
, “
A Case Study of Economic Optimization of HVAC Systems Based on the Stanford University Campus Airside and Waterside Systems
,”
8th International High Performance Buildings Conference
,
West Lafayette, IN
,
July 7–12
.
16.
Deng
,
K.
,
Sun
,
Y.
,
Li
,
S.
,
Lu
,
Y.
,
Brouwer
,
J.
,
Mehta
,
P. G.
,
Zhou
,
M.
, and
Chakraborty
,
A.
,
2015
, “
Model Predictive Control of Central Chiller Plant With Thermal Energy Storage Via Dynamic Programming and Mixed-Integer Linear Programming
,”
IEEE Trans. Autom. Sci. Eng.
,
12
(
2
), pp.
565
579
.
17.
Kim
,
D.
,
Wang
,
Z.
,
Brugger
,
J.
,
Blum
,
D.
,
Wetter
,
M.
,
Hong
,
T.
, and
Piette
,
M. A.
,
2022
, “
Site Demonstration and Performance Evaluation of MPC for a Large Chiller Plant With TES for Renewable Energy Integration and Grid Decarbonization
,”
Appl. Energy
,
321
, p.
119343
.
18.
Manoharan
,
P.
,
Venkat
,
M. P.
,
Nagarathinam
,
S.
, and
Vasan
,
A.
,
2021
, “
Learn to Chill: Intelligent Chiller Scheduling Using Meta-Learning and Deep Reinforcement Learning
,”
8th ACM International Conference on Systems for Energy-Efficient Buildings, Cities, and Transportation
,
Coimbra, Portugal
,
Nov. 17–18
, pp.
21
30
.
19.
Qiu
,
S.
,
Li
,
Z.
,
Li
,
Z.
, and
Zhang
,
X.
,
2020
, “
Model-Free Optimal Chiller Loading Method Based on Q-Learning
,”
Sci. Technol. Built Environ.
,
26
(
8
), pp.
1100
1116
.
20.
Qiu
,
S.
,
Li
,
Z.
,
Fan
,
D.
,
He
,
R.
,
Dai
,
X.
, and
Li
,
Z.
,
2022
, “
Chilled Water Temperature Resetting Using Model-Free Reinforcement Learning: Engineering Application
,”
Energy Build.
,
255
, p.
111694
.
21.
Nagarathinam
,
S.
,
Menon
,
V.
,
Vasan
,
A.
, and
Sivasubramaniam
,
A.
,
2020
, “
Marco – Multi-agent Reinforcement Learning Based Control of Building HVAC Systems
,”
Eleventh ACM International Conference on Future Energy Systems
,
New York, NY
,
June 22–26
, pp.
57
67
.
22.
Campos
,
G.
,
El-Farra
,
N. H.
, and
Palazoglu
,
A.
,
2022
, “
Soft Actor-Critic Deep Reinforcement Learning With Hybrid Mixed-Integer Actions for Demand Responsive Scheduling of Energy Systems
,”
Ind. Eng. Chem. Res.
,
61
(
24
), pp.
8443
8461
.
23.
Ahn
,
K. U.
, and
Park
,
C. S.
,
2020
, “
Application of Deep Q-Networks for Model-Free Optimal Control Balancing Between Different HVAC Systems
,”
Sci. Technol. Built Environ.
,
26
(
1
), pp.
61
74
.
24.
Qiu
,
S.
,
Li
,
Z.
,
Li
,
Z.
,
Li
,
J.
,
Long
,
S.
, and
Li
,
X.
,
2020
, “
Model-Free Control Method Based on Reinforcement Learning for Building Cooling Water Systems: Validation by Measured Data-Based Simulation
,”
Energy Build.
,
218
, p.
110055
.
25.
Henze
,
G. P.
, and
Schoenmann
,
J.
,
2003
, “
Evaluation of Reinforcement Learning Control for Thermal Energy Storage Systems
,”
HVAC&R Res.
,
9
(
3
), pp.
259
275
.
26.
Liu
,
S.
, and
Henze
,
G. P.
,
2007
, “
Evaluation of Reinforcement Learning for Optimal Control of Building Active and Passive Thermal Storage Inventory
,”
ASME J. Sol. Energy Eng.
,
129
(
2
), pp.
215
225
.
27.
Lu
,
F.
,
Mehta
,
P. G.
,
Meyn
,
S. P.
, and
Neu
,
G.
,
2021
, “
Convex Q-Learning
,”
American Control Conference
,
Virtual online
,
May 25–28
, IEEE, pp.
4749
4756
.
28.
Lagoudakis
,
M. G.
, and
Parr
,
R.
,
2003
, “
Least-Squares Policy Iteration
,”
J. Mach. Learn. Res.
,
4
, pp.
1107
1149
.
29.
Gibney
,
E.
,
2017
, “
Self-Taught AI is Best Yet At Strategy Game Go
,”
Nature
,
10
(
1
), pp.
68
74
.
30.
Banjac
,
G.
, and
Lygeros
,
J.
,
2019
, “
A Data-Driven Policy Iteration Scheme Based on Linear Programming
,”
58th IEEE Conference on Decision and Control
,
Nice, France
,
Dec. 11–13
, IEEE, pp.
816
821
.
31.
Luo
,
B.
,
Liu
,
D.
,
Wu
,
H.-N.
,
Wang
,
D.
, and
Lewis
,
F. L.
,
2017
, “
Policy Gradient Adaptive Dynamic Programming for Data-Based Optimal Control
,”
IEEE Trans. Cybern.
,
47
(
10
), pp.
3341
3354
.
32.
Fan
,
C.
,
Hinkelman
,
K.
,
Fu
,
Y.
,
Zuo
,
W.
,
Huang
,
S.
,
Shi
,
C.
,
Mamaghani
,
N.
,
Faulkner
,
C.
, and
Zhou
,
X.
,
2021
, “
Open-Source Modelica Models for the Control Performance Simulation of Chiller Plants With Water-Side Economizer
,”
Appl. Energy
,
299
, p.
117337
.
33.
Guo
,
Z.
,
Coffman
,
A. R.
, and
Barooah
,
P.
,
2022
, “
Reinforcement Learning for Optimal Control of a District Cooling Energy Plant
,” American Control Conference (ACC), pp.
3329
3334
.
34.
Guo
,
Z.
,
Chaudhari
,
A.
,
Coffman
,
A. R.
, and
Barooah
,
P.
,
2023
, “Optimal Control of District Cooling Energy Plant With Reinforcement Learning and MPC,” arXiv preprint 2310.03814.
35.
Yu
,
F.
, and
Chan
,
K.
,
2008
, “
Optimization of Water-Cooled Chiller System With Load-Based Speed Control
,”
Appl. Energy
,
85
(
10
), pp.
931
950
.
36.
Andersson
,
J. A. E.
,
Gillis
,
J.
,
Horn
,
G.
,
Rawlings
,
J. B.
, and
Diehl
,
M.
,
2019
, “
CasADi: A Software Framework for Nonlinear Optimization and Optimal Control
,”
Math. Program. Comput.
,
11
(
1
), pp.
1
36
.
37.
Wächter
,
A.
, and
Biegler
,
L. T.
,
2006
, “
On the Implementation of an Interior-Point Filter Line-Search Algorithm for Large-Scale Nonlinear Programming
,”
Math. Program.
,
106
(
1
), pp.
25
57
.
38.
American Society of Heating Refrigerating and Air Conditioning Engineers
,
2002
,
ASHRAE Guideline14-2002 for Measurement of Energy and Demand Savings.
ASHRAE
,
Atlanta, GA
, p.
151
.
39.
Braun
,
J. E.
, and
Diderrich
,
G. T.
,
1990
, “
Near-Optimal Control of Cooling Towers for Chilled-Water Systems
,”
ASHRAE Trans. (Am. Soc. Heat. Refrig. Air-Cond. Eng.
96
(
2
), pp.
806
813
.
40.
Miller
,
C.
,
2014
, .
41.
Miller
,
C.
,
Nagy
,
Z.
, and
Schlueter
,
A.
,
2014
, “
A Seed Dataset for a Public, Temporal Data Repository for Energy Informatics Research on Commercial Building Performance
,”
Third Conference on Future Energy Business & Energy Informatics
,
Rotterdam, Netherlands
,
June 20
.
42.
Sutton
,
R.
, and
Barto
,
A.
,
2018
,
Reinforcement Learning: An Introduction
, 2nd ed.,
MIT Press
,
Cambridge, MA
.
43.
Grant
,
M.
, and
Boyd
,
S.
,
2011
, CVX: Matlab Software for Disciplined Convex Programming, version 1.21, http://cvxr.com/cvx, February.
44.
PJM data miner
,
2022
, PJM Interconnection Real-Time Hourly LMPs., https://www.pjm.com/markets-and-operations/etools/data-miner-2, Accessed October 2, 2022.
45.
Braun
,
J. E.
, and
Chaturvedia
,
N.
,
2002
, “
An Inverse Gray-Box Model for Transient Building Load Prediction
,”
HVAC&R Res.
,
8
(
1
), pp.
73
99
.
46.
Guo
,
Z.
,
Coffman
,
A. R.
,
Munk
,
J.
,
Im
,
P.
,
Kuruganti
,
T.
, and
Barooah
,
P.
,
2021
, “
Aggregation and Data Driven Identification of Building Thermal Dynamic Model and Unmeasured Disturbance
,”
Energy Build.
,
231
, p.
110500
.
47.
Oldewurtel
,
F.
,
Ulbig
,
A.
,
Parisio
,
A.
,
Andersson
,
G.
, and
Morari
,
M.
,
2010
, “
Reducing Peak Electricity Demand in Building Climate Control Using Real-Time Pricing and Model Predictive Control
,”
49th IEEE Conference on Decision and Control
,
Atlanta, GA
,
Dec. 15–17
, pp.
1927
1932
.
48.
National Weather Service
and
NOAA
,
2022
, https://www.weather.gov/, Accessed November 1, 2022.