Goal-Directed Decision Making with Spiking Neurons

Johannes Friedrich; Máté Lengyel

doi:10.1523/JNEUROSCI.2854-15.2016

Article Figures & Data

Figures

Download figure
Open in new tab
Download powerpoint
Figure 1.
Reinforcement learning benchmark tasks. A, Maze task (see Materials and Methods for details). B, Pendulum swing-up task (see Materials and Methods for details). C, Convergence of the dynamics toward an optimal policy representation with weights set according to the true environment. Values were computed based on spike counts up to the time indicated on the horizontal axis. Performance shows discounted average (±SEM) cumulative reward obtained by the policy based on these values, normalized such that random action selection corresponds to 0 and the optimal policy corresponds to 1. D, Learning the environmental model through synaptic plasticity. In each trial, first several randomly chosen state–action pairs were experienced and weights in the network were updated accordingly, then the dynamics of the network evolved for 1 s and its performance was measured as in C. E, Distributed representation of the continuous state space in the Pendulum task. Ellipses show 3 SD covariances of the Gaussian basis functions of individual neurons (for better visualization only, every second basis is shown along each axis). F, Activity of four representative neurons during planning. Color identifies the neurons' state-space basis functions as in E, and line style shows two different initial conditions (see inset for better magnification). G, Values of the preferred states of the neurons shown in F as represented by the network over the course of its dynamics. Although both initial state values (inset) and steady-state values coincide in the two examples shown (solid vs dashed lines), the interim dynamics differ because of different neural initial conditions (F, inset). H, Policy (colored areas) and state space trajectory (gray scale circles, temporally ordered from white to black) for pendulum swing-up with preset weights. I, Values actually realized by the network. J, True optimal values for the Pendulum task.
Download figure
Open in new tab
Download powerpoint
Figure 2.
Two-step example task. A, The rat moving through the maze can choose the left (L) or right (R) arm at four decision points (states 0, 1, 2, and 3). Turning right in the first step (state 0) leads to a place where one of two doors opens randomly, indicated by the coin flip. The sizes of the cheeses indicate reward magnitudes (see also B). B, The decision graph corresponding to the task in A is a tree for this task. Numerical values indicate rewards (r) and transition probabilities (p) for nondeterministic actions. C, The corresponding neural network: action nodes in B are identified with neurons (colors). Lines indicate synaptic connections, with thickness and size scaled according to their strength. A constant external input (black) signals immediate reward. Synaptic efficacies are proportional to the transition probabilities or the (expected) reward. D, Voltage traces for two neurons in C. E, Spike trains of all neurons. The color code is the same as in C. F, Activity for rate neurons with random initial values. The color code is the same as in C. The line style indicates neurons coding for optimal (solid) and suboptimal (dashed) actions. G, The approximate values Ṽ, represented by the sum of the rates in F, converge to the optimal values (black dashed lines). Values of states 0–3 are shown from the bottom to top. The color code is the same as in B.
Download figure
Open in new tab
Download powerpoint
Figure 3.
Time course of neural activity in a binary choice task. A, The task (top) consisting of a single state (s₀) and two actions (A and B) associated with different values (which, in this case, were also their immediate rewards, r_A and r_B) and the corresponding neural network (bottom). B, Average population activity for offer value cells (dashed line) from the study by Padoa-Schioppa (2013) and simulation results (solid line). Trials were divided into three groups depending on the offer value (colors). C, Average population activity (dashed line) from the study by Roesch and Olson (2003) and model results (solid line). Trials were divided depending on whether the cell encoded the optimal action (blue) or not (purple) and on whether the reward was large (thick) or small (thin). The activity of the reward input used in the simulations is shown as a black curve in B and C with the corresponding y-axis plotted on the right side.
Download figure
Open in new tab
Download powerpoint
Figure 4.
Value dependence of neural firing rates in a binary choice task in experiments (open green circles; adapted from Padoa-Schioppa and Assad, 2006) and simulations (filled blue circles). A, Neuron encoding offer value of option B. One unit of juice A was worth 2.2 units of juice B. B, Neuron encoding chosen value, 1A = 2.5B. Error bars show SEM and were often smaller than the symbols.
Download figure
Open in new tab
Download powerpoint
Figure 5.
Psychometric and chronometric curves in a binary decision-making task. A, B, Choice probabilities in experiments (open green squares; Padoa-Schioppa and Assad, 2006) and simulations (filled blue squares) for two different relative values of the two juices: 1A = 2.2B (A) and 1A = 2.5B (B). C, Difference between the cumulative spike counts of populations representing the two potential choices in the model. Accumulation starts with sensory delay (dashed line; compare input onset in Fig. 3B). When a threshold (red line) is reached, a decision is made. Colors indicate different value ratios as in D. D, Decision time distributions in the model. Right, Dependence of raw decision times on the value ratio (colored Tukey's boxplots) and their overall distribution across all value ratios (gray histogram). Left, Normalizing function (solid blue line), together with a logarithmic fit (dashed black line), which transforms the raw decision time distribution into a standard normal distribution (gray histogram). E, Normalized reaction times (±SEM) as a function of value ratio in experiments (open green squares; Padoa-Schioppa and Assad, 2006) and simulations (filled blue squares). Lines show least squares fits (dotted green, experiments; solid blue, simulations); the inset shows distribution of residuals after fitting (green bars, experiments; blue bars, simulations).
Download figure
Open in new tab
Download powerpoint
Figure 6.
Sequential decision making. A, An example neuron in pre-SMA showing activity modulated by the NRMs (colored lines; Sohn and Lee, 2007): amplitude decreases and delay increases with NRM. The inset shows task structure: colored circles indicate states (numbers show NRMs), arrows show state transitions (colored lines, correct action; black lines, incorrect action), and the gray square represents terminal state with reward (modeled as r = 1). B, Activity time courses of an example model neuron as a function of NRMs. The color code is the same as in A. The black line shows activity of the reward input chosen to fit experimental data. C, Activity time courses of an example model neuron as a function of the number of available actions (1 correct, others incorrect) in the state with NRM = 3. D, Experimental (open green squares; Sohn and Lee, 2007) and simulated (filled blue squares) reaction times increased approximately linearly with NRMs. Error bars (SEM) are all smaller than the symbols. E, F, Predictions of planning-as-inference for neural time courses as a function of NRMs (E) and number of available actions (F). The color code is the same as in B and C.
Download figure
Open in new tab
Download powerpoint
Figure 7.
Predictions for a novel sequential decision-making task. A, Task structure with rewards in two distinct steps; symbols are as in Figure 6A (inset). B, Simulation results for the suggested task with added intermediate reward. The color code and activity of the reward input are as in Figure 6B. C, Reaction times (blue squares) and peak firing rates (purple circles) from the simulations in B vary nonmonotonically with NRM. Error bars (SEM) are often smaller than the symbols.
Download figure
Open in new tab
Download powerpoint
Figure 8.
Reinforcer devaluation. A, The rat moving through the maze can turn left or right at three decision points (states 0, 1, and 2; colored numbers). The numbers above the terminal positions indicate the corresponding rewards. Devaluation decreases the reward associated with cheese (top left) from a baseline level of 4 to a devalued level of 2. [Adapted from Niv et al. (2006).] B, Simulated firing rates with baseline reward values. Colors indicate the state–action pair encoded by each cell, following the color scheme in A. The activity of the reward input (black) is as in Figure 6B. C, D, Choice probabilities (C) and reaction times (D; ±SEM) in each state. E, Activity profile for a spreading-activation model (darker means increasing activity). The path of an agent following the activity gradient (green) yields only a reward of 3 instead of the optimal 4. F–H, Same as in B–D following devaluation in our model. Note the change in the choice at the initial decision point (state 0).