Reinforcement Learning and Concurrent Variable Interval Schedules

Final Project, 9.290
Melissa Warden

When presented with two concurrent variable interval reward schedules, rats visit the richer schedule more frequently, and the ratio of the time spent on each schedule approximates the ratio of the reward rates.  Gallistel, Mark, King, and Latham (2001) argue that well trained rats learn to adjust to new schedules more quickly than can be explained by reinforcement learning.  The aim of this project is to show that fast switching can indeed be accounted for by reinforcement learning. 

Reinforcement Learning
Reinforcement learning is the process of learning about stimuli on the basis of their associated rewards and punishments.  One form of reinforcement learning is operant, or instrumental conditioning.  In this form of reinforcement learning an animal is rewarded for correctly performing an action.  Through trial and error it eventually learns to perform well on a given behavioral task by strengthening some of its responses and weakening others.

The Variable Interval Schedule
Once an animal has learned a behavior well by being rewarded on every correct action, it is possible to only reward the animal on some fraction of the correct trials and still maintain the behavior.  An interval schedule makes it impossible for the animal to receive a reward until a time delay has passed since the last reward.  A variable interval schedule draws these delays from an exponential distribution (or a geometric distribution in discrete time) with a parameter that sets the overall rate of reward.  When the delay has passed, the reward system is 'primed' and the next time the animal performs the correct action it will receive a reward. 

Concurrent Schedules and Herrnstein's Matching Law
If an animal has learned to correctly perform one variable interval schedule (for example, a rat that has learned how to press a lever that only gives rewards at a certain rate), it may be possible to train it on two simultanously presented schedules - these are known as concurrent schedules.  In our case, a rat would be presented with two levers that give rewards according to variable interval schedules running at different rates.  When animals are actually subjected to this situation, the time that they spend at each lever is in proportion to the rate at which they are actually rewarded at that lever.  For example, if one lever is rewarded three times as frequently as the other, a rat would spend three times as much time there (Herrnstein, 1961).  This behavior has been shown to be close to optimal, and no other response strategy significantly increases the animal's net reward (Baum, 1981).

Change of Schedule Rates in the Naive Rat
When an animal is highly familiar with a given ratio of reward rates on a pair of levers, it is resistant to learning a new set of schedules.  The figure on the right shows the behavior of 6 rats experiencing a change in ratio of schedules on the two levers for the first time.  For example, Subject A needs to switch from matching a reward ratio of 1:1 to a new ratio of 1:4.  Though it eventually manages to do this, it takes some time before it has fully realized that the levers are operating according to new rates.  In this figure, the rats had been exposed to a constant ratio of reward for 33 days, and have now had an uncued change in the reward schedules.  Gallistel et. al. believe that under these conditions it is possible that the rat is learning the new schedules through reinforcement learning because the behavioral transition is slow.

These figures show the cumulative amount of time that the rat spent on lever 1 (the ordinate) vs. the amount of time spent on lever 2 (abscissa).  The straight lines show the predicted amount of time spent on each side according to the matching law.

 

Change of Schedule Rates in the Expert Rat
In this figure the rat has been required to adapt to a new set of schedules every day for several days.  As shown by this figure, rats can 'learn to learn' - they become much better at quickly adapting to a new ratio of rewards, sometimes accomplishing this after only a few rewards.  Gallistel et. al. believe that the fast rate of learning shown by these rats cannot be explained by reinforcement learning - they argue that the rat doesn't have enough time to strengthen and weaken its behavior as a function of the rewards obtained at each lever and must instead be learning by monitoring its current income.


Simulations
To investigate whether or not this rapid rate of learning could be explained by reinforcement learning, a behaving rat operating under a concurrent variable interval schedule was simulated.  Each lever was modelled as an independent quasi-Poisson process, with an exponential distribution between reward availabilities.  Since the model was created using discrete time, the exponential distribution was approximated as a geometric distribution.  Each second the lever was primed with a reward according to a probability distribution.  When a lever was primed it remained in this state until the rat collected the reward.  The rat was also modelled as a quasi-Poisson process (in reality a Bernoulli process).  It pressed one of the levers each second and switched between them according to a geometric distribution.   Exact parameters are given in the matlab code.

Matching Behavior
The first test of the simulation was to see whether matching according to Herrnstein's matching law was indeed optimal.  The simulation was tested with a number of reward schedule ratios.  At each reward schedule ratio many different behavioral ratios were tested to find the optimal amount of time that the rat should spend at each lever.  When the levers are rewarded at nearly equal rates the simulation produces results consistent with expectations, but large divergences from a 1:1 ratio lead to pronounced overmatching.  This result holds over a large range of overall reward rates, and it is not yet clear why there is such a large discrepancy with theory.  The figures below show the results of the simulation.  Matlab code is at the end.





Reinforcement Learning Theory
We then tried to get the rat to learn to optimize its behavior to different reward schedule ratios.  This was accomplished by creating a set of possible behavioral ratios with associated probability distributions .  These distributions would influence which ratio the rat chose to follow, and therefore how much time it spent at each lever.  Reinforcement learning was used to strenthen certain behavioral ratios and weaken others, eventually leading to optimal behavior. 

In our simulation, the rat uses a stochastic policy to choose between possible behavioral ratios.  This means that it chooses among possible behavioral ratios with some probability p associated with each ratio, and updates its probability distribution as each reward comes in.  The probability distribution we chose to use in this simulation is the softmax distribtion (Dayan and Abbott, 2001):



where a is the behavioral ratio, m is the parameter influenced by reinforcement learning, and beta determines the variability of the possible ratios.  For example, if we have two possible ratios, the probability that we choose ratio 1 or ratio 2 would be:




These probabilities are the standard sigmoid function written slightly differently:



which is a function of m2-m1.  Therefore, the greater the difference in this underlying parameter, the more likely we are to choose one schedule over another.

To update these probabilities, we use the Rescorla-Wagner rule, which develops weights that approximate the average reward:




Here, epsilon is the learning rate and delta is the difference between the value of the reward and the m we're currently operating under, here shown for schedule 1:



A schedule parameter is only adjusted if it is actually under use at the time of the reward.  Under this set of learning rules, the behavior of the rat should eventually approximate the ratio of rewards that it is receiving at each lever.


Reinforcement Learning in Simulation
The results of the simulations show that the rate of switching seems to depend on how many possible behavioral ratios the rat is using.  If it believes that there are many possible reward ratios it will learn relatively slowly to approximate the true ratio.  However, if the rat thinks that there are only two possible reward schedules, it quickly adapts to new schedules.  Below are shown the results of two such simulations, one using 2 possible behavioral ratios, one using 19.  The actual lever schedules changed from a 9:1 ratio to 1:9.  If the rat was using two schedules, they were ratios of 9:1 and 1:9.  If it was using 19 schedules, 9:1 and 1:9 were included in the possibilitie.  In the figures, the blue line is the cumulative amount of time that the rat spent on each lever, the black lines are the rat's ideal performance under the matching law, and the red star is the point at which the lever schedules changed.  The matlab code is at the end.



Below are shown the probabilities for each of the possible behavioral schedules as time progresses.  For the 19-schedule scenario, only the probabilities for 9:1 and 1:9 are drawn.  Schedule 1 for the left figure is 9:1, and schedule 2 is 1:9.  For the right figure, schedule 2 is 9:1 and schedule 18 is 1:9.  An obvious switch of ratio probabilities is evident at the time of the lever schedule change when two schedules are used.  The switch is much less obvious under the many-schedule scenario.


The m parameters also behave in a way consistent with the results of the simulation, and are shown below:



Conclusions
The results obtained demonstrate that it is indeed possible to use an algorithm based on the principles of reinforcement learning to generate the observed rapid schedule switching.  It is possible that at the beginning of training the rat is operating under the assumption that any ratio of rewards between the two levers is possible, and therefore takes time to settle on one that is close to optimal.  A highly trained rat, however, may assume that there are only a few possible schedules from which to pick, and therefore chooses rapidly.  The simulations presented here are consistent with this hypothesis. 

Disclaimer
Although the results obtained so far are consistent with the hypothesis, the full parameter space of these simulations was not tested due to time constraints.  It is possible that I am missing something.

Matlab Code
Optimal Matching
Switching with 2 ratios
Switching with 19 ratios

References
Baum, W. M. (1981).  Optimization and the matching law as accounts of instrumental behavior.  Journal of the Experimental Analysis of Behavior, 36, 387-403.
Dayan, P. and Abbott, L.F. (2001).  Theoretical neuroscience:  computational and mathematical modeling of neural systems, 340-344.
Gallistel, C. R., Mark, T. A., King, A. P., and Latham, P. E. (2001).  The rat approximates an ideal detector of changes in rates of reward:  implications for the law of effect.  Journal of Experimental Psychology:  Animal Behavior Processes, 27(4), 354-372
Herrnstein, R. J. (1961).  Relative and absolute strength of response as a function of frequency of reinforcement.  Journal of the Experimental Analysis of Behavior, 4, 267-272.