Integrated learning and planning based on truncating temporal differences

Paweł Cichosz


Reinforcement learning systems learn to act in an uncertain environment by executing actions and observing their long-term effects. A large number of time steps may be required before this trial- and-error process converges to a satisfactory policy. It is highly desirable that the number of experiences needed by the system to learn to perform its task be minimized, particularly if making errors costs much. One approach to achieve this goal is to use hypothetical experiences, which requires some additional computation, but may reduce the necessary number of much more costly real experiences. This well-known idea of augmenting reinforcement learning by planning is revisited in this paper in the context of truncated TD(λ), or TTD, a simple computational technique which allows reinforcement learning algorithms based on the methods of temporal differences to learn considerably faster with essentially no additional computational expense. Two different ways of combining TTD with planning are proposed which make it possible to benefit from λ\\textgreater\0 in both the learning and planning processes. The algorithms are evaluated experimentally on a family of grid path-finding tasks and shown to indeed yield a considerable reduction of the number of real interactions with the environment necessary to converge, as well as an improvement of scaling properties.
Author Paweł Cichosz (FEIT / PE)
Paweł Cichosz,,
- The Institute of Electronic Systems
Publication size in sheets0.75
Book Someren Maarten van, Widmer Gerhard (eds.): Machine Learning: ECML-97, Lecture Notes In Computer Science, no. 1224, 1997, Springer Berlin Heidelberg, ISBN 978-3-540-62858-3, [978-3-540-68708-5]
Keywords in EnglishAlgorithm Analysis and Problem Complexity, Artificial Intelligence (incl. Robotics)
Languageen angielski
Score (nominal)3
Publication indicators GS Citations = 3.0
Citation count*3 (2020-08-28)
Share Share

Get link to the record

* presented citation count is obtained through Internet information analysis and it is close to the number calculated by the Publish or Perish system.
Are you sure?