When presented with two concurrent
variable interval reward schedules, rats visit the richer schedule more
frequently, and the ratio of the time spent on each schedule
approximates the ratio of the reward rates. Gallistel, Mark,
King, and Latham (2001) argue that well trained rats learn to adjust to
new schedules more quickly than can be explained by reinforcement
learning. The aim of this project is to show that fast switching
can indeed be accounted for by reinforcement learning.
Reinforcement Learning
Reinforcement learning is the process of learning about stimuli on the
basis of their associated rewards and punishments. One form of
reinforcement learning is operant, or instrumental conditioning.
In this form of reinforcement learning an animal is rewarded for
correctly performing an action. Through trial and error it
eventually learns to perform well on a given behavioral task by
strengthening some of its responses and weakening others.
The Variable Interval
Schedule
Once an animal has learned a behavior well by being rewarded on every
correct action, it is possible to only reward the animal on some
fraction of the correct trials and still maintain the behavior.
An interval schedule makes it impossible for the animal to receive a
reward until a time delay has passed since the last reward. A
variable interval schedule draws these delays from an exponential
distribution (or a geometric distribution in discrete time) with a
parameter that sets the overall rate of reward. When the delay
has passed, the reward system is 'primed' and the next time the animal
performs the correct action it will receive a reward.
Concurrent Schedules
and Herrnstein's Matching Law
If an animal has learned to correctly perform one
variable interval schedule (for example, a rat that has learned how to
press a lever that only gives rewards at a certain rate), it may be
possible to train it on two simultanously presented schedules - these
are known as concurrent schedules. In our case, a rat would be
presented with two levers that give rewards according to variable
interval schedules running at different rates. When animals are
actually subjected to this situation, the time that they spend at each
lever is in proportion to the rate at which they are actually rewarded
at that lever. For example, if one lever is rewarded three times
as frequently as the other, a rat would spend three times as much time
there (Herrnstein, 1961). This behavior has been shown to be
close to optimal, and no other response strategy significantly
increases the animal's net reward (Baum, 1981).
Change of Schedule Rates in the Naive Rat
When an animal is highly familiar with a given
ratio of reward rates on a pair of levers, it is resistant to learning
a new set of schedules. The figure on the right shows the
behavior of 6 rats experiencing a change in ratio of schedules on the
two levers for the first time. For example, Subject A needs to
switch from matching a reward ratio of 1:1 to a new ratio of 1:4.
Though it eventually manages to do this, it takes some time before it
has fully realized that the levers are operating according to new
rates. In this figure, the rats had been exposed to a constant
ratio of reward for 33 days, and have now had an uncued change in the
reward schedules. Gallistel et. al. believe that under these
conditions it is possible that the rat is learning the new schedules
through reinforcement learning because the behavioral transition is
slow.
These figures show the cumulative amount of time that the rat spent on
lever 1 (the ordinate) vs. the amount of time spent on lever 2
(abscissa). The straight lines show the predicted amount of time
spent on each side according to the matching law.
|

|
Change of Schedule Rates in the Expert Rat
In this figure the rat has been required to adapt to a new set of
schedules every day for several days. As shown by this figure,
rats can 'learn to learn' - they become much better at quickly adapting
to a new ratio of rewards, sometimes accomplishing this after only a
few rewards. Gallistel et. al. believe that the fast rate of
learning shown by these rats cannot be explained by reinforcement
learning - they argue that the rat doesn't have enough time to
strengthen and weaken its behavior as a function of the rewards
obtained at each lever and must instead be learning by monitoring its
current income.
|

|
Simulations
To investigate whether or not this rapid rate of learning
could be explained by reinforcement learning, a behaving rat operating
under a concurrent variable interval schedule was simulated. Each
lever was modelled as an independent quasi-Poisson process, with an
exponential distribution between reward availabilities. Since the
model was created using discrete time, the exponential distribution was
approximated as a geometric distribution. Each second the lever
was primed with a reward according to a probability distribution.
When a lever was primed it remained in this state until the rat
collected the reward. The rat was also modelled as a
quasi-Poisson process (in reality a Bernoulli process). It
pressed one of the levers each second and switched between them
according to a geometric distribution. Exact parameters are
given in the matlab code.
Matching Behavior
The first test of the simulation was to see whether
matching according to Herrnstein's matching law was indeed
optimal. The simulation was tested with a number of reward
schedule ratios. At each reward schedule ratio many different
behavioral ratios were tested to find the optimal amount of time that
the rat should spend at each lever. When the levers are rewarded
at nearly equal rates the simulation produces results consistent with
expectations, but large divergences from a 1:1 ratio lead to pronounced
overmatching. This result holds over a large range of overall
reward rates, and it is not yet clear why there is such a large
discrepancy with theory. The figures below show the results of
the simulation. Matlab code is at the end.
Reinforcement Learning
Theory
We then tried to get the rat to learn to optimize its
behavior to different reward schedule ratios. This was
accomplished by creating a set of possible behavioral ratios with
associated probability distributions . These distributions would
influence which ratio the rat chose to follow, and therefore how much
time it spent at each lever. Reinforcement learning was used to
strenthen certain behavioral ratios and weaken others, eventually
leading to optimal behavior.
In our simulation, the rat uses a stochastic policy to choose between
possible behavioral ratios. This means that it chooses among
possible behavioral ratios with some probability p associated with each
ratio, and updates its probability distribution as each reward comes
in. The probability distribution we chose to use in this
simulation is the softmax distribtion (Dayan and Abbott, 2001):
where a is the behavioral ratio, m is the parameter influenced by
reinforcement learning, and beta determines the variability of the
possible ratios. For example, if we have two possible ratios, the
probability that we choose ratio 1 or ratio 2 would be:
These probabilities are the standard sigmoid function written slightly
differently:
which is a function of m2-m1. Therefore, the greater the
difference in this underlying parameter, the more likely we are to
choose one schedule over another.
To update these probabilities, we use the Rescorla-Wagner rule, which
develops weights that approximate the average reward:
Here, epsilon is the learning rate and delta is the difference between
the value of the reward and the m we're currently operating under, here
shown for schedule 1:
A schedule parameter is only adjusted if
it is actually under use at the time of the reward. Under this
set of learning rules, the behavior of the rat should eventually
approximate the ratio of rewards that it is receiving at each lever.
Reinforcement Learning
in Simulation
The results of the simulations show that the rate of switching seems to
depend on how many possible behavioral ratios the rat is using.
If it believes that there are many possible reward ratios it will learn
relatively slowly to approximate the true ratio. However, if the
rat thinks that there are only two possible reward schedules, it
quickly adapts to new schedules. Below are shown the results of
two such simulations, one using 2 possible behavioral ratios, one using
19. The actual lever schedules changed from a 9:1 ratio to
1:9. If the rat was using two schedules, they were ratios of 9:1
and 1:9. If it was using 19 schedules, 9:1 and 1:9 were included
in the possibilitie. In the figures, the blue line is the
cumulative amount of time that the rat spent on each lever, the black
lines are the rat's ideal performance under the matching law, and the
red star is the point at which the lever schedules changed. The
matlab code is at the end.

|

|
Below are shown the probabilities for each of the possible behavioral
schedules as time progresses. For the 19-schedule scenario, only
the probabilities for 9:1 and 1:9 are drawn. Schedule 1 for the
left figure is 9:1, and schedule 2 is 1:9. For the right figure,
schedule 2 is 9:1 and schedule 18 is 1:9. An obvious switch of
ratio probabilities is evident at the time of the lever schedule change
when two schedules are used. The switch is much less obvious
under the many-schedule scenario.
The m parameters also behave in a way consistent with the results of
the simulation, and are shown below:
Conclusions
The results obtained demonstrate that it is indeed possible to use an
algorithm based on the principles of reinforcement learning to generate
the observed rapid schedule switching. It is possible that at the
beginning of training the rat is operating under the assumption that
any ratio of rewards between the two levers is possible, and therefore
takes time to settle on one that is close to optimal. A highly
trained rat, however, may assume that there are only a few possible
schedules from which to pick, and therefore chooses rapidly. The
simulations presented here are consistent with this hypothesis.
Disclaimer
Although the results obtained so far are consistent with the
hypothesis, the full parameter space of these simulations was not
tested due to time constraints. It is possible that I am missing
something.
Matlab Code
Optimal Matching
Switching with 2
ratios
Switching with 19
ratios
References
Baum, W. M. (1981). Optimization and the matching
law as accounts of instrumental behavior.
Journal of the Experimental Analysis of
Behavior, 36, 387-403.
Dayan, P. and Abbott, L.F. (2001). Theoretical
neuroscience: computational and mathematical modeling of neural
systems, 340-344.
Gallistel, C. R., Mark, T. A., King, A. P., and Latham, P. E.
(2001). The rat approximates an ideal detector of changes in
rates of reward: implications for the law of effect.
Journal of Experimental Psychology:
Animal Behavior Processes, 27(4), 354-372
Herrnstein, R. J. (1961). Relative and absolute strength of
response as a function of frequency of reinforcement.
Journal of the Experimental Analysis of
Behavior, 4, 267-272.