PhD Thesis
The following is a link to my PhD thesis "Learning from Delayed Rewards", Cambridge, 1989. Unfortunately, the original electronic version is long lost, and this version has been scanned in from a photocopy.
The thesis introduces the notion of reinforcement learning as learning to control a Markov Decision Process by incremental dynamic programming, and describes a range of algorithms for doing this, including Q-learning, for which a sketch of a proof of convergence is given.