{"title": "Reinforcement Learning by Probability Matching", "book": "Advances in Neural Information Processing Systems", "page_first": 1080, "page_last": 1086, "abstract": null, "full_text": "Reinforcement Learning by Probability \n\nMatching \n\nPhilip N. Sabes \n\nMichael I. Jordan \n\nsabes~psyche.mit.edu \n\njordan~psyche.mit.edu \n\nDepartment of Brain and Cognitive Sciences \n\nMassachusetts Institute of Technology \n\nCambridge, MA 02139 \n\nAbstract \n\nWe present a new algorithm for associative reinforcement learn(cid:173)\ning. The algorithm is based upon the idea of matching a network's \noutput probability with a probability distribution derived from the \nenvironment's reward signal. This Probability Matching algorithm \nis shown to perform faster and be less susceptible to local minima \nthan previously existing algorithms. We use Probability Match(cid:173)\ning to train mixture of experts networks, an architecture for which \nother reinforcement learning rules fail to converge reliably on even \nsimple problems. This architecture is particularly well suited for \nour algorithm as it can compute arbitrarily complex functions yet \ncalculation of the output probability is simple. \n\n1 \n\nINTRODUCTION \n\nThe problem of learning associative networks from scalar reinforcement signals is \nnotoriously difficult . Although general purpose algorithms such as REINFORCE \n(Williams, 1992) and Generalized Learning Automata (Phansalkar, 1991) exist, they \nare generally slow and have trouble with local minima. As an example, when we \nattempted to apply these algorithms to mixture of experts networks (Jacobs et al. , \n1991), the algorithms typically converged to the local minimum which places the \nentire burden of the task on one expert. \n\nHere we present a new reinforcement learning algorithm which has faster and more \nreliable convergence properties than previous algorithms. The next section describes \nthe algorithm and draws comparisons between it and existing algorithms. The \nfollowing section details its application to Gaussian units and mixtures of Gaussian \nexperts. Finally, we present empirical results. \n\n\fReinforcement Learning by Probability Matching \n\n1081 \n\n2 REINFORCEMENT PROBABILITY MATCHING \n\nWe begin by formalizing the learning problem. Given an input x E X from the \nenvironment, the network must select an output y E y. The network then receives \na scalar reward signal r, with a mean r and distribution that depend on x and \ny. The goal of the learner is to choose an output which maximizes the expected \nreward. Due to the lack of an explicit error signal, the learner must choose its \noutput stochastically, exploring for better rewards. Typically the learner starts with \na parameterized form for the conditional output density P8(ylx), and the learning \nproblem becomes one of finding the parameters 0 which maximize the expected \nreward: \n\nJr(O) = 1 p(x)p8(ylx)r(x, y)dydx. \n\nX,Y \n\nWe present an alternative route to the maximum expected reward cost function, \nand in doing so derive a novel learning rule for updating the network's parameters. \nThe learner's task is to choose from a set of conditional output distributions based \non the reward it receives from the environment. These rewards can be thought of as \ninverse energies; input/output pairs that receive high rewards are low energy and \nare preferred by the environment. Energies can always be converted into probabil(cid:173)\nities through the Boltzmann distribution, and so we can define the environment's \nconditional distribution on Y given x, \n\n*( I) exp( _T- 1 E(x, y\u00bb \npyx = \n\nZT(X) \n\n= \n\nexp(T-1r(x, y\u00bb \n, \n\nZT(X) \n\nwhere T is a temperature parameter and ZT(X) is a normalization constant which \ndepends on T . This distribution can be thought of as representing the environment's \nideal input-output mapping, high reward input-output pairs being more typical or \nlikely than low reward pairs. The temperature parameter determines the strength of \nthis preference: when T is infinity all outputs are equally likely; when T is zero only \nthe highest reward output is chosen. This new distribution is a purely theoretical \nconstruct, but it can be used as a target distribution for the learner. If the 0 are \nadjusted so that P8(ylx) is nearly equal to p*(ylx), then the network's output will \ntypically result in high rewards. \n\nThe agreement between the network and environment conditional output densities \ncan be measured with the Kullback-Liebler (KL) divergence: \n\nK L(p II P*) = -1 p(x)p8(ylx) [logp*(Ylx) -logp8(ylx)] dydx \n\nX,Y \n\n= -~ 1 p(x)p8(ylx)[r(x,y) - Tr8(X,y)]dydx+ f p(x)logZT(x)dx, \n\n(1) \n\nX,Y \n\nJx \n\nwhere r8(x, y) is defined as the logarithm of the conditional output probability and \ncan be thought of as the network's estimate of the mean reward. This cost function \nis always greater than or equal to zero, with equality only when the two probability \ndistributions are identical. \nKeeping only the part of Equation 1 which depends on 0, we define the Probability \nMatching (PM) cost function: \n\nJpM(O) = - f \n\np(x)p8(ylx)[r(x, y) - Tr8(X, y)] dydx = -Jr(O) - TS(P8) \n\nJx,y \n\nThe PM cost function is analogous to a free energy, balancing the energy, in the \nform of the negative of the average reward, and the entropy S(P8) of the output \n\n\f1082 \n\nP. N. SABES, M. I. JORDAN \n\nT=l \n\nT=.5 \n\n-1 \n\nT= .2 \n\n-1 \n\n-0.5 \n\n0 \n\n0.5 \n\nT= .05 \n\nFigure 1: p*'s (dashed) and PM optimal Gaussians (solid) for the same bimodal reward \nfunction and various temperatures. Note the differences in scale. \n\ndistribution. A higher T corresponds to a smoother target distribution and tilts the \nbalance of the cost function in favor of the entropy term, making diffuse output dis(cid:173)\ntributions more favorable. Likewise, a small T results in a sharp target distribution \nplacing most of the weight on the reward dependent term of cost function, which is \nalways optimized by the singular solution of a spike at the highest reward output. \n\nAlthough minimizing the PM cost function will result in sampling most often at \nhigh reward outputs, it will not optimize the overall expected reward if T > O. \nThere are two reasons for this. First, the output y which maximizes ro(x, y) may \nnot maximize rex, y). Such an example is seen in the first panel of Figure 1: \nthe network's conditional output density is a Gaussian with adjustable mean and \nvariance, and the environment has a bimodal reward function and T = 1. Even in \nthe realizable case, however, the network will choose outputs which are suboptimal \nwith respect to its own predicted reward, with the probability of choosing output y \nfalling off exponentially with ro(x, y). The key point here is that early in learning \nthis non-optimality is exactly what is desired. The PM cost function forces the \nlearner to maintain output density everywhere the reward, as measure by p*l/T, is \nnot much smaller than its maximum. When T is high, the rewards are effectively \nflattened and even fairly small rewards look big. This means that a high temperature \nensures that the learner will explore the output space. \n\nOnce the network is nearly PM optimal, it would be advantageous to \"sharpen up\" \nthe conditional output distribution, sampling more often at outputs with higher \npredicted rewards. This translates to decreasing the entropy of the output distri(cid:173)\nbution or lowering T. Figure 1 shows how the PM optimal Gaussian changes as the \ntemperature is lowered in the example discussed above; at very low temperatures \nthe output is almost always near the mode of the target distribution. In the limit \nof T = 0, J PM becomes original reward maximization criterion Jr. The idea of the \nProbability Matching algorithm is to begin training with a large T, say unity, and \ngradually decrease it as the performance improves, effectively shifting the bias of \nthe learner from exploration to exploitation. \n\nWe now must find an update rule for 0 which minimizes JpM(O). We proceed by \nlooking for a stochastic gradient descent step. Differentiating the cost function gives \n\n\\T OJpM(O) = -1 p(x)po(Ylx) [rex, y) - Tro(x, y)] \\T oro(x, y)dydx. \n\nX,Y \n\nThus, if after every action the parameters are updated by the step \n\nt:.o = a [r - Tro(x, y)] \\T oro (x, y), \n\n(2) \n\nwhere alpha is a constant which can vary over time, then the parameters will on \naverage move down the gradient of the PM cost function. Note that any quantity \n\n\fReinforcement Learning by Probability Matching \n\n1083 \n\nwhich does not depend on Y or r can be added to the difference in the update rule, \nand the expected step will still point along the direction of the gradient. \n\nThe form of Equation 2 is similar to the REINFORCE algorithm (Williams, 1992), \nwhose update rule is \n\nt:.() = a(r - b)V' elogpe(Ylx), \n\nwhere b, the reinforcement baseline, is a quantity which does not depend on Y or r. \nNote that these two update rules are identical when T is zero.! The advantage of the \nPM rule is that it allows for an early training phase which encourages exploration \nwithout forcing the output distribution to converge on suboptimal outputs. This \nwill lead to striking qualitative differences in the performance of the algorithm for \ntraining mixtures of Gaussian experts. \n\n3 UPDATE RULES FOR GAUSSIAN UNITS AND \n\nMIXTURES OF GAUSSIAN EXPERTS \n\nWe employ Gaussian units with mean I' = w T x and covariance 0\"21. The learner \nmust select the matrix wand scalar 0\" which minimize JpM(W, 0\"). Applying the \nupdate rule in Equation 2, we get \n\nt:.w \n\n\" \n\n1 \n0\" \n\na[r - Tr(x,y)] 2\"(Y -I'?x \n\n\" 1 (\"Y -I'W \n\n0\"2 \n\na [r - Tr(x, y)] 0\"2 \n\n) \n- 1 \n\n. \n\nIn practice, for both single Gaussian units and the mixtures presented below we \navoid the issue of constraining 0\" > 0 by updating log 0\" directly. \nWe can generalize the linear model by considering a conditional output distribution \nin the form of a mixture of Gaussian experts (Jacobs et al., 1991), \n\np(Ylx) = Lgi(x)(27r0\"1)-~ exp(--2I1y -l'iW)\u00b7 \n\n1 \n\nN \n\ni=! \n\n1 \n\n20\"i \n\nExpert i has mean I'i = w r x and covariance 0\"[1. The prior probability given x of \n\nchoosing expert i, gi(X), is determined by a single layer gating network with weight \nmatrix v and softmax output units. The gating network learns a soft partitioning \nof the input space into regions for which each expert is responsible. \nAgain, we can apply Equation 2 to get the PM update rules: \n\nt:.Vi \n\nt:.Wi \n\n6.O\"i \n\na [r - Tf(x,y)] (hi - gi)X \na [r - Tf(x, y)] hi~(Y -\na[r-Tf(x,Y)]hi : 1 (\"Y~riW -1), \n\nl'i?X \n\nO\"i \n\nwhere hi = giPi(ylx)jp(ylx) is the posterior probability of choosing expert i given \ny. We note that the PM update rules are equivalent to the supervised learning \ngradient descent update rules in (Jacobs et al., 1991) modulated by the difference \nbetween the actual and expected rewards. \n\nlThis fact implies that the REINFORCE step is in the direction of the gradient of JR(B), \nas shown by (Williams, 1992). See Williams and Peng, 1991, for a similar REINFORCE \nplus entropy update rule. \n\n\f1084 \n\nP. N. SABES, M. I. JORDAN \n\nTable 1: Convergence times and gate entropies for the linear example (standard errors \nin parentheses). Convergence times: An experiment consisting of 50 runs was conducted \nfor each algorithm, with a wide range of learning rates and both reward functions. Best \nresults for each algorithm are reported. Entropy: Values are averages over the last 5,000 \ntime steps of each run. 20 runs of 50,000 time steps were conducted. \nI Convergence Time I Entropy \n.0011 \n.02 \n.04 \n.03 \n.03 \n\nAlgorithm \nPM, T= 1 \nPM, T=.5 \nPM, T =.1 \nREINFORCE \nREINF-COMP \n\n.993 \n.97 \n.48 \n.21 \n.21 \n\n1088 (43) \n\n-\n-\n\n2998 (183) \n1622 (46) \n\nBoth the hi and r depend on the overall conditional probability p(ylx), which in \nturn depends on each Pi(ylx). This adds an extra step to the training procedure. \nAfter receiving the input x, the network chooses an expert based on the priors gi(X) \nand an output y from the selected expert's output distribution. The output is then \n. sent back to each of the experts in order to compute the likelihood of their having \ngenerated it. Given the set of Pi'S, the network can update its parameters as above. \n\n4 SIMULATIONS \n\nWe present three examples designed to explore the behavior of the Probability \nMatching algorithm. In each case, networks were trained using Probability Match(cid:173)\ning, REINFORCE, and REINFORCE with reinforcement comparison (REINF(cid:173)\nCOMP), where a running average of the reward is used as a reinforcement base(cid:173)\nline (Sutton, 1984). In the first two examples an optimal output function y*(x) \nwas chosen and used to calculate a noisy error, c = Ily - y*(x) - zll, where z was \ni.i.d. zero-mean Gaussian with u = .1. The error signal determined the reward \nby one of two functions, r = -c2/2 or exp( _c2 /2). When the RMSE between the \nnetwork mean and the optimal output was less that .05 the network was said to \nhave converged. \n\n4.1 A Linear Example \n\nIn this example x was chosen uniformly from [-1,1]2 x {I}, and the optimal output \nwas y* = Ax, for a 2 x 3 matrix A. A mixture of three Gaussian experts was trained. \nThe details of the simulation and results for each algorithm are shown in Table 1. \nProbability Matching with constant T = 1 shows almost a threefold reduction \nin training time compared to REINFORCE and about a 50% improvement over \nREINF-COMP. \n\nThe important point of this example is the manner in which the extra Gaussian \nunits were employed. We calculated the entropy of the gating network, normalized \nso that a value of one means that each expert has equal probability of being chosen \nand a value of zero means that only one expert is ever chosen. The values after \n50,000 time steps are shown in the second column of Table 1. When T ~ 1, the \nProbability Matching algorithm gives the three experts roughly equal priors. This \nis due to the fact that small differences in the experts' parameters lead to increased \noutput entropy if all experts are used. REINFORCE on the other hand always \nconverges to a solution which employs only one expert. This difference in the \nbehavior of the algorithms will have a large qualitative effect in the next example. \n\n\fReinforcement Learning by Probability Matching \n\nJ085 \n\nTable 2: Results for absolute value. The percentage of trials that converged and the \naverage time to convergence for those trials. Standard errors are in parentheses. 50 trials \nwere conducted for a range of learning rates and with both reward functions; the best \nresults for each algorithm are shown. \n\nAlgorithm \n\nPM \n\nREINFORCE \nREINF-COMP \n\nI Successful Trials I Convergence Time I \n\n100% \n48% \n38% \n\n6,052 313) \n\n76,775 3,329) \n42,105 3,869) \n\n110 \n\n100 \n\n10 \n\n60 \n\n.0 \n\n10 \n\n8.0 \n\n2.0 \n\n0 .0 \n\n-2.0 \n\n(a) \n\n(b) \n\n(c) \n\n(d) \n\n0.0 \n\n1.0 1 .0 \n\n'.0 \n\n' .0 \n\n' .0 \n\n0.0 \n\n1.0 2.0 \n\n'.0 \n\n'.0 \n\n'.0 \n\nFigure 2: Example 4.3. The environment's probability distribution for T = 1: (a) density \nplot of p. vs. y / x, (b) cross-sectional view with Y2 = o. Locally weighted mean and \nvariance of Y2/X over representative runs: (c) T = 1, (d) T = 0 (i.e. REINFORCE). \n\n4.2 Absolute Value \n\nWe used a mixture of two Gaussian units to learn the absolute value function. \nThe details of the simulation and the best results for each algorithm are shown in \nTable 2. Probability Matching with constant T = 1 converged to criterion on every \ntrial, in marked contrast to the REINFORCE algorithm. With no reinforcement \nbaseline, REINFORCE converged to criterion in only about half of the cases, less \nwith reinforcement comparison. In almost all of the trials that didn't converge, only \none expert was active on the domain of the input. Neither version of REINFORCE \never converged to criterion in less than 14,000 time steps. \n\nThis example highlights the advantage of the Probability Matching algorithm. Dur(cid:173)\ning training, all three algorithms initially use both experts to capture the overall \nmean of the data. REINFORCE converges on this local minimum, cutting one \nexpert off before it has a chance to explore the rest of the parameter space. The \nProbability Matching algorithm keeps both experts in use. Here, the more conser(cid:173)\nvative approach leads to a stark improvement in performance. \n\n4.3 An Example with Many Local Maxima \n\nIn this example, the learner's conditional output distribution was a bivariate Gaus(cid:173)\nsian with It = [Wl, W2]T x, and the environment's rewards were a function of y/x. \nThe optimal output distribution p*(y/x) is shown in Figures 2(a,b). These figures \ncan also be interpreted as the expected value of p* for a given w. The weight vector \nis initially chosen from a uniform distribution over [-.2, .2]2, depicted as the very \nsmall while dot in Figure 2(a). There are a series of larger and larger local maxima \noff to the right, with a peak of height 2n at Wl = 2n. \n\nThe results are shown in Table 3. REINFORCE, both with and without reinforce(cid:173)\nment comparison, never got past third peak; the variance of the Gaussian unit would \n\n\f1086 \n\nP. N. SABES, M. I. JORDAN \n\nTable 3: Results for Example 4.3. These values represent 20 runs for 50,000 time steps \neach. The first and second columns correspond to number of the peak the learner reached. \n\nMean Final Range of Final Mean Final \n\nAlgorithm \nPM, T= 2 \nPM, T = 1 \nPM, T=.5 \nREINFORCE \nREINF-COMP \n\nlog2 Wl \n\n28,8 \n6.34 \n3.06 \n2.17 \n2.05 \n\nlog2 Wl'S \n[19.1,51.0] \n5.09,8.08 \n3.04,3.07 \n2.00,2.90 \n2.05,2.06 \n\n(T \n\n> 101> \n13.1 \n.40 \n.019 \n.18 \n\nvery quickly close down to a small value making further exploration of the output \nspace impossible. Probability Matching, on the other hand, was able to find greater \nand greater maxima, with the variance growing adaptively to match the local scale \nof the reward function. These differences can be clearly seen in Figures 2( c,d), \nwhich show typical behavior for the Probability Matching algorithm with T = 1 \nand T = O. \n\n5 CONCLUSION \n\nWe have presented a new reinforcement learning algorithm for associative networks \nwhich converges faster and more reliably than existing algorithms. The strength of \nthe Probability Matching algorithm is that it allows for a better balance between \nexploration of the output space and and exploitation of good outputs. The param(cid:173)\neter T can be adjusted during learning to allow broader output distributions early \nin training and then to force the network to sharpen up its distribution once nearly \noptimal parameters have been found. \nAlthough the applications in this paper were restricted to networks with Gaus(cid:173)\nsian units, the Probability Matching algorithm can be applied to any reinforcement \nlearning task and any conditional output distribution. It could easily be employed, \nfor example, on classification problems using logistic or multinomial (softmax) out(cid:173)\nput units or mixtures of such units. Finally, the simulations presented in this paper \nare of simple examples. Preliminary results indicate that the advantages of the \nProbability Matching algorithm scale up to larger, more interesting problems. \n\nReferences \n\nJacobs, R. A., Jordan, M. I., Nowlan, S. J., and Hinton, G. E. (1991). Adaptive \n\nmixtures of local experts. Neural Computation, 3:79-87. \n\nPhansalkar, V. V. (1991). Learning automata algorithms for connectionist systems \n- local and global convergence. PhD Thesis, Dept. of Electrical Engineering, \nIndia Institute of Science, Bangalore. \n\nSutton, R. S. (1984). \n\nTemporal credit assignment in reinforcement learning. \n\nPhD Thesis, Dept. of Computer and Information Science, University of Mas(cid:173)\nsachusetts, Amherst, MA. \n\nWilliams, R. J. (1992) . Simple statistical gradient-following algorithms for connec(cid:173)\n\ntionist reinforcement learning. Machine Learning, 8:229-256. \n\nWilliams, R. J. and Peng, J. (1991). Function optimization using connectionist \n\nreinforcement learning algorithms. Connection Science, 3:241-268. \n\n\f", "award": [], "sourceid": 1042, "authors": [{"given_name": "Philip", "family_name": "Sabes", "institution": null}, {"given_name": "Michael", "family_name": "Jordan", "institution": null}]}