# Using Free Energies to Represent Q-values in a Multiagent
Reinforcement Learning Task

Brian Sallans

Department of Computer Science

University of Toronto

Toronto M5S 2Z9 Canada

Geoffrey Hinton

Gatsby Computational Neuroscience Unit

University College London

17 Queen Square, London WC1N 3AR, UK

**Abstract**

The problem of reinforcement learning in large factored Markov
decision processes is explored. The Q-value of a state-action pair is approximated
by the free energy of a product of experts network. Network parameters are learned
on-line using a modified SARSA algorithm which minimizes the inconsistency of the Q-values
of consecutive state-action pairs. Actions are chosen based on the current value
estimates by fixing the current state and sampling actions from the network using Gibbs
sampling. The algorithm is tested on a co-operative multi-agent task. The
product of experts model is found to perform comparably to table-based Q-learning for
small instances of the task, and continues to perform well when the problem becomes too
large for a table-based representation.

Download [pdf] [ps.gz]

*Submitted to Advances in Neural Information Processing Systems
13, MIT Press, Cambridge, MA*

[home page] [publications]