Public Library of Science
pcbi.1008317.g002.tif (1.97 MB)

Transferring state abstractions between MDPs.

Download (1.97 MB)
posted on 2020-10-15, 17:54 authored by Lucas Lehnert, Michael L. Littman, Michael J. Frank

(A) In both grid-world tasks, the agent can move up (↑), down (↓), left (←), or right (→) and is rewarded when a reward column is entered. The black line indicates a barrier the agent cannot pass. Both Task A and Task B differ in their rewards and transitions, because a different column is rewarded and the barrier is placed at different locations. (B) A reward-predictive state representation generalizes across different columns and the corresponding SFs are plotted below in (D). (D) Each row in the shown matrix plots visualizes the entries of a three dimensional SF vector. Similar to the example in Fig 1, a reward-predictive state abstraction merges each column into one latent state, as indicated by the colouring. In both tasks, reward sequences can be predicted using the compressed representation for any arbitrary start state and action sequence, similar to Fig 1B. In this case the agent simply needs to learn a different policy for Task B using the same compressed representation. In contrast, the matrix plots in the bottom panels illustrate that SFs are different in each task and cannot be immediately reused in this example (because SFs are computed for the optimal policy which is different in each task [14]). Note that states that belong to the same column have equal SF weights (as indicated by the coloured boxes). LSFMs construct a reward-predictive state representation by merging states with equal SFs into the same state partition. This algorithm is described in supporting S3 Text and prior work [9]. (C) One possible reward-maximizing state abstraction may generalize across all states. While it is possible to learn or compute the optimal policy using this state abstraction in Task A (i.e., always go right), this state abstraction cannot be used to learn the optimal policy in Task B in which the column position is needed to know whether to go left or right. This example illustrates that reward-predictive state representations are suitable for re-use across tasks that vary in rewards and transitions. While reward-maximizing state abstractions may compress a task further than reward-predictive state abstractions, reward-maximizing state abstractions may also simplify a task to an extend that renders them proprietary to a single specific task.