## Simulation 1: Methods and results.

**a)** The task (2-armed bandit) is represented like a binary choice task (blue or red squares), where the model decisions are represented as joystick movements. After each choice, the model received either a reward (sun) or not (cross). **b)** Example of task design with time line of statistical environments (order of presentation of different environments was randomized across simulations). The plot shows reward probability linked to each option (blue or red) as a function of trial number. In this case the model executed the task first in a stationary environment (Stat), then in a stationary environment with high uncertainty (Stat2), and finally in a volatile (Vol) environment. **c)** Learning rate (*λ*) time course (average across simulations ± s.e.m.). As the order of statistical environments was randomized across simulations, each simulation time course was sorted as Stat-Stat2-Vol. **d, e)** Average ∠ (across time and simulations) as a function of environmental volatility (± s.e.m.) in the RML (d) and humans (**e;** modified from: [30]). **f)** human pupil size (proxy of LC activity [34–36]) during the same task.