Chris Watkins - Reinforcement Learning History

Turing Award 2025

Congratulations to Andy Barto and Rich Sutton on winning the 2025 Turing award.
They started the field, produced fundamental results, and it is the result of their long careful leadership and bounding enthusiasm that RL smoothly achieved the success that it has today.

Reinforcement Learning: some history

I was a thoroughly unsuccessful graduate student at King's College, Cambridge from 1982-85. I tried to formalise Piaget's theories of the genesis of intelligence in children, without success. So I left and got a job with the AI group at Philips Research Labs in Surrey, England, where I worked on expert systems and decision trees. I kept my interest in fundamental questions about learning, and I still hoped to finish my PhD.

In 1987, Philips generously sent me to the 4th International Workshop on Machine Learning at UC Irvine, CA, although I had no paper to present. It was my first international conference and I vividly remember some of the talks. I was bored during one talk, and perhaps rudely I interrupted the speaker to ask a totally irrelevant question "Have you done any work on animal learning?" Without a beat he said "Oh that's all been done" and carried on with his presentation. My mind wandered for the rest of the session. I realised that animal learning had not 'all been done', and I realised to my surprise that I had not seen any ML research at all related to animal learning. Studying children had been too hard for me; perhaps with animal learning I could make more progress?

In the coffee break, Rich Sutton came over to me and introduced himself. He said he liked my question, and kindly gave me reprints of some of his papers, including his now-famous 1983 paper with Andy Barto and Chuck Anderson on learning pole-balancing.

On the long flight home from California I read and re-read this 1983 paper. I was entranced. Highly original, unlike any other work I had seen. It described a simple algorithm which seemed curiously unrelated to other theory, like a new small island that had risen by itself out of the ocean.

Surely, I felt, there must be some theoretical framework that explained why this algorithm worked? I went to the Philips company library and looked for it. Luckily the library was small and a little out of date, and I found the right book, "Applied Dynamic Programming" by Bellman and Dreyfus, published 25 years before in 1962. From this book, I learned about Markov Decision Processes (MDPs) and how to calculate optimal policies using dynamic programming.

This was the framework I was looking for! But the dynamic programming algorithms from 1962 did not look like 'learning'. Bellman and Dreyfus assumed that the entire MDP was known, and all its parameters could be used for exact calculation.

But what if a small animal or agent were living inside an MDP? What if it didn't know the MDP's structure? The agent would experience and explore the MDP one state at a time. At each time-step, it would observe the state, choose an action, receive a reward, then find itself in the next state. Could such an agent learn an optimal policy? How much would it need to remember?

I realised that a natural way to think of reinforcement learning was as incremental dynamic programming by an agent experiencing and exploring an MDP. The actor-critic architecture of Barto, Sutton, and Anderson fitted nicely into this framework, but there was a family of other possible algorithms as well, estimating expected future returns using the elegant method of Temporal Differences that Sutton published in 1988. Finally I alighted on the simplest method of all: one-step Q learning. I sketched a proof of its convergence.

By sheer good luck, in Spring 1989 Andy Barto was on sabbatical in Cambridge visiting my college, King's. I went to see him there, with an early draft of my work. He made helpful comments, and we passed a draft to Rich Sutton, who made helpful comments too. It was decided that Andy would be the external examiner for my viva (that is, the external on my committee). Soon he would go back to the US, so I had a deadline. I finally submitted my thesis Learning from Delayed Rewards in May 1989.

Andy Barto, Rich Sutton and I then published two papers explaining the relationship of their actor-critic methods to the MDP dynamic programming framework:

Rich and Andy must have been surprised to learn of my work, and they were most friendly, generous, and welcoming, and we have been on friendly terms ever since. Indeed, if I remember correctly from those pre-pdf days, for some time they were generous enough to mail actual paper photocopies of my unpublished thesis to researchers in the US who asked for it.

At the end of 1989 I left Philips Labs, and for a few years I worked for hedge-funds in London, so I stopped doing active research in reinforcement learning, apart from working with Peter Dayan to publish

Peter, who was then a graduate student at Edinburgh, had contacted me to point out that the convergence proof I had sketched did not show convergence with probability 1. This paper corrected that, and explained the algorithm in a journal paper.

The field of reinforcement learning took some time to produce significant results, but now the achievements and speed of progress are astonishing. I am proud to have played my part in it.

Explanation of the pictures

The pictures at the top are ones I use in teaching. They show half-completed Q learning on a 500x500 gridworld. This is a very simple and standard RL problem, but the behaviour of Q learning can be surprisingly complex.

The dark bar is a 'puddle' where all rewards are negative, and the small bright rectangle below it is a 'target'. When the agent reaches the target, it is placed back at a random position in the grid. The agent performs epsilon-greedy action selection, with epsilon=0.1. Initial Q values were set below the true optimal values.

The left plot is a heatmap of the learned value function. Note the slow back-chaining of the value function out from the target, and the striking fingering instability caused by the random exploration of the agent.

The right plot shows the average number of visits to each cell in the grid during a run at this stage of learning. The visit count is moderately high in the areas where the agent still has not found an effective policy, and the suboptimal 'path habits' within the central area are clearly visible.

Fingering instabilities and path habits are common in RL. Why not just set the initial Q values higher than optimal? Well then the agent is pathologically exploratory, which works just fine in this tiny grid, but is undesirable in a larger state space.

The moral is that sophisticated control of exploration is essential for more complex tasks. In a sense, reinforcement learning only provides half an algorithm: the other half is the control of exploration.