Integrated learning and planning based on truncating temporal differences
AbstractReinforcement learning systems learn to act in an uncertain environment by executing actions and observing their long-term effects. A large number of time steps may be required before this trial- and-error process converges to a satisfactory policy. It is highly desirable that the number of experiences needed by the system to learn to perform its task be minimized, particularly if making errors costs much. One approach to achieve this goal is to use hypothetical experiences, which requires some additional computation, but may reduce the necessary number of much more costly real experiences. This well-known idea of augmenting reinforcement learning by planning is revisited in this paper in the context of truncated TD(λ), or TTD, a simple computational technique which allows reinforcement learning algorithms based on the methods of temporal differences to learn considerably faster with essentially no additional computational expense. Two different ways of combining TTD with planning are proposed which make it possible to benefit from λ\\textgreater\0 in both the learning and planning processes. The algorithms are evaluated experimentally on a family of grid path-finding tasks and shown to indeed yield a considerable reduction of the number of real interactions with the environment necessary to converge, as well as an improvement of scaling properties.
|Publication size in sheets||0.75|
|Book||Someren Maarten van, Widmer Gerhard (eds.): Machine Learning: ECML-97, Lecture Notes In Computer Science, no. 1224, 1997, Springer Berlin Heidelberg, ISBN 978-3-540-62858-3, [978-3-540-68708-5]|
|Keywords in English||Algorithm Analysis and Problem Complexity, Artificial Intelligence (incl. Robotics)|
|Publication indicators||= 3.0|
|Citation count*||3 (2020-08-28)|
* presented citation count is obtained through Internet information analysis and it is close to the number calculated by the Publish or Perish system.