## Abstract

During perceptual decisions, subjects often rely more strongly on early, rather than late, sensory evidence, even in tasks when both are equally informative about the correct decision. This early psychophysical weighting has been explained by an integration-to-bound decision process, in which the stimulus is ignored after the accumulated evidence reaches a certain bound, or confidence level. Here, we derive predictions about how the average temporal weighting of the evidence depends on a subject's decision confidence in this model. To test these predictions empirically, we devised a method to infer decision confidence from pupil size in 2 male monkeys performing a disparity discrimination task. Our animals' data confirmed the integration-to-bound predictions, with different internal decision bounds and different levels of correlation between pupil size and decision confidence accounting for differences between animals. However, the data were less compatible with two alternative accounts for early psychophysical weighting: attractor dynamics either within the decision area or due to feedback to sensory areas, or a feedforward account due to neuronal response adaptation. This approach also opens the door to using confidence more broadly when studying the neural basis of decision making.

**SIGNIFICANCE STATEMENT** An animal's ability to adjust decisions based on its level of confidence, sometimes referred to as “metacognition,” has generated substantial interest in neuroscience. Here, we show how measurements of pupil diameter in macaques can be used to infer their confidence. This technique opens the door to more neurophysiological studies of confidence because it eliminates the need for training on behavioral paradigms to evaluate confidence. We then use this technique to test predictions from competing explanations of why subjects in perceptual decision making often rely more strongly on early evidence: the way in which the strength of this effect should depend on a subject's decision confidence. We find that a bounded decision formation process best explains our empirical data.

- confidence
- integration-to-bound
- macaque
- perceptual decision making
- psychophysical reverse correlation
- pupillometry

## Introduction

During perceptual discrimination tasks, subjects often rely more strongly on early, rather than late, sensory evidence, even when both are equally informative about the correct decision (e.g., Kiani et al., 2008; Neri and Levi, 2008; Nienborg and Cumming, 2009; Yates et al., 2017). But some studies in rodents and humans reported uniform weighting of the stimulus throughout the trial (Brunton et al., 2013; Raposo et al., 2014; Drugowitsch et al., 2016). From the perspective of maximizing the sensory information and hence performance, such early weighting is nonoptimal. Understanding this behavior may shed light on how the activity, or the read-out of sensory neurons limits our perceptual abilities, a major goal of contemporary neuroscience (e.g., Pitkow et al., 2015; Cumming and Nienborg, 2016; Clery et al., 2017). The classical explanation for such early psychophysical weighting is that it reflects an integration-to-bound decision process in which sensory evidence is ignored once an internal decision bound is reached (Kiani et al., 2008). For simple perceptual discrimination tasks, decision confidence can be defined statistically (Hangya et al., 2016), and hence also measured for such a model. Here, we derived new predictions of this model for how the temporal weighting of sensory evidence should vary as a function of decision confidence on individual trials. These revealed characteristic differences in the temporal weighting for high- and low-confidence trials, depending on the decision bound. We then sought to test these predictions in macaques performing a fixed duration visual discrimination task while also estimating the animal's subjective decision confidence.

Measuring decision confidence psychophysically is relatively difficult, particularly in animals, and increases the complexity of a task (e.g., for post-decision wagering) (Kiani and Shadlen, 2009; Komura et al., 2013), hence requiring additional training. To avoid these difficulties, we devised a metric based on the monkeys' pupil size. Combining this metric for decision confidence with psychophysical reverse correlation (Neri et al., 1999; Nienborg and Cumming, 2007, 2009) allowed us to quantify the animals' psychophysical weighting strategy for different levels of inferred decision confidence, and test our model predictions. The animals showed clear early psychophysical weighting on average. But separating this analysis by inferred decision confidence revealed that early psychophysical weighting was largely restricted to high-confidence trials. Indeed, on low inferred confidence trials, the animals weighted the stimulus relatively uniformly or even slightly more toward the end of the trial. Such behavior matched the predictions of the integration-to-bound model. Furthermore, the differences between both animals could be accounted for by the model by differences in its only two free parameters: internal decision bound as well as the level of uncertainty in our inference of decision confidence.

The animals' behavior was not as well explained by two alternative accounts of early psychophysical weighting. The first alternative account are models in which the decision stage provides self-reinforcing feedback to the sensory neurons (Wimmer et al., 2015), as suggested for example, for probabilistic inference (Haefner et al., 2016) or by attractor dynamics within the decision making area (Wang, 2002; Wong et al., 2007). The second, recent alternative proposal is that the early weighting simply reflects the feedforward effect of the dynamics (gain control or adaptation) of the activity of the sensory neurons (Yates et al., 2017). Although each of these alternatives predicts the early weighting, we were unable to fully capture the animals' data with the temporal weighting predictions of these models when separating trials by decision confidence.

Together, our data suggest that the animals rely on a bounded decision formation process. In this model, evidence at the end of the trial is only ignored once a certain level of decision confidence is reached, thereby reducing the impact on performance. Moreover, this combination of techniques provides a novel tool for a more fine-grained dissection of an animal's psychophysical behavior.

## Materials and Methods

##### Animal preparation and surgery.

All experimental protocols were approved by the local authorities (Regierungspräsidium Tübingen). Two adult male rhesus monkeys (Macaca mulatta), Animal A (7 kg; 11 years old) and Animal B (8 kg; 11 years old), housed in pairs, participated in the experiments. The animals were surgically implanted with a titanium headpost under general anesthesia using aseptic techniques as described previously (Seillier et al., 2017).

##### Visual discrimination task.

The animals were trained to perform a two choice disparity discrimination task (see Fig. 2*a*). The animals initiated trials with the visual fixation on a small white fixation spot (size: 0.08°–0.12°) located on the center of the screen. After the animals maintained fixation for 500 ms, a visual stimulus was presented (median eccentricity for Animal A: 5.3°; range 3.0°-9.0°, median eccentricity for Animal B: 3.0°, range 2.3°–5.0°) for 1500 ms. After that two choice targets, each consisting of a symbol representing either a near or a far choice and whose positions were randomized between trials, appeared above and below the fixation spot. Once the fixation spot disappeared, the animals were allowed to make a choice via saccade to one of these targets. The animals received a liquid reward for correct choices. Randomizing target positions allowed us to disentangle saccade direction and choice.

##### Visual stimuli.

Visual stimuli (luminance linearized) were back-projected on a screen using a DLP LED Propixx projector (ViewPixx; run at 100 Hz 1920 × 1080 pixel resolution, 30 cd/m^{2} mean luminance) and an active circular polarizer (Depth Q; 200 Hz) for Animal B (viewing distance 97.5 cm), or two projection design projectors (F21 DLP; 60 Hz 1920 × 1080 pixel resolution, 225 cd/m^{2} mean luminance, and a viewing distance of 149 cm) and passive linear polarizing filters for Animal A. The animals viewed the screen through passive circular (Animal A) or linear (Animal B), respectively, polarizing filter. Stimuli were generated with custom written software using MATLAB (The MathWorks) and the psychophysics toolbox (Brainard, 1997; Pelli, 1997; Kleiner et al., 2007).

The stimuli were circular dynamic random dot stereograms, which consisted of equal numbers of white and black dots, similar to those previously used (Nienborg and Cumming, 2009). Each random dot stereogram had a disparity-varying circular center (3° diameter) surrounded by an annulus (1° wide) shown at 0° disparity. On each video-frame, all center dots had the same disparity whose value was changed randomly on each video-frame according to the probability mass distribution set for the stimulus. For the 0% signal stimulus, the disparity was drawn from a uniform distribution (typically 11 values in 0.05° increments from −0.25° to 0.25°). The monkeys were rewarded randomly on half of the trials on 0% signal trials. These 0% signal trials were randomly interleaved with near disparity or far disparity signal trials. For each session, one near and one far disparity value was used to introduce disparity signal by increasing the probability of this disparity on each video-frame during the stimulus presentation on this trial. The range of signal strengths was adjusted between sessions to manipulate task difficulty and encourage performance at psychophysical threshold. Typical added signal values were 3%, 6%, 12%, 25%, and 50%.

The choice target symbols were random dot stereograms very similar to 100% signal stimuli, except that their diameter was smaller (2.2°).

To allow for constant mean luminance across the screen, equal numbers of black and white dots were used for the stimulus and the target symbols. Because we used a white fixation dot, systematic differences in fixation precision could, in principle, influence our findings. If this were the case, a black fixation marker should give the opposite results. We therefore also conducted control experiments using a black fixation marker, which yielded very similar results, indicating that systematic differences in fixation precision are insufficient to explain our findings.

##### Reward size.

To discourage the animals from guessing, the available reward size was increased based on their task performance. After 3 consecutive trials with correct choices, the available reward size was doubled compared with the original reward size. After 4 consecutive trials with correct choices, the available reward size was again doubled (quadruple compared with the original size) and remained at this size until the next error. After every error trial, the available reward size was reset to the original. For the analyses in Figure 5, “large available reward” trials refer to both intermediate and large available reward trials collapsed.

##### Pupil data acquisition and analysis.

During the experiments, the animals' eye positions and pupil size were measured at 500 Hz using an infrared video-based eye tracker (Eyelink 1000, SR Research), digitized and stored for the subsequent offline analysis. The eye tracker was mounted in a fixed position on the primate chair to minimize variability in pupil size measurements between sessions. Our pupil analysis focused on the period of animals' fixation in which the gaze angles were constant. The background of the display was at mid-gray levels throughout, resulting in considerable illumination of the darkened experimental booths. To nonetheless exclude the possibility that our results were substantially affected by the dark adaptation of the pupils after the animals entered the experimental booths, we performed control analyses for which we excluded the initial 20 min of each experimental session, to allow for dark adaptation of the pupil (Hansen and Fulton, 1986), with very similar results (data not shown).

Only successfully completed trials (correct and error trials) were included for the analysis. During preprocessing, we first downsampled the pupil size data such that the sampling rate matched the refresh rates of our screens (60 Hz for Animal A, 100 Hz for Animal B), effectively low-pass filtering the data. We next high-pass filtered the data by subtracting on each trial the mean pupil size of the preceding 10 and following 10 trials (excluding the value of the current trial). This analysis removed linear trends on the pupil size within a session and was omitted for the analysis of pupil size changes throughout a session (see Fig. 3*a*). This analysis yielded qualitatively similar results to bandpass filtering (e.g., de Gee et al., 2014; Urai et al., 2017) the pupil size data. Finally, pupil size measurements were *z*-scored using the mean and SD during the stimulus presentation period across all trials.

When comparing pupil size across conditions, we aimed to minimize any mean difference of pupil size between conditions at stimulus onset. To do so, we computed a baseline pupil size, which was defined as the average pupil size in the epoch 200 ms before stimulus onset, and iteratively excluded trials in which the baseline value deviated most from the condition with the higher number of trials until the absolute mean difference of the *z* score of the baseline pupil size was ≤0.05. This procedure successfully made the baseline pupil size statistically indistinguishable across conditions with a small loss of trials in each session (for details, see Inclusion criteria).

##### Psychometric threshold.

The animals' choice-behaviors were summarized as a psychometric function by plotting the percentage of “far” choices as a function of the signed disparity signals and then fitted with a cumulative Gaussian function using maximum likelihood estimation (see Fig. 2*b*). The SD of the cumulative Gaussian fit was defined as the psychophysical threshold and corresponds to the 84% correct level. The mean of the cumulative Gaussian quantified the subject's bias.

##### Psychophysical kernel.

Psychophysical kernels were computed to quantify how the animals used the stimulus for their choices (Nienborg and Cumming, 2009, 2007). Only 0% signal trials were used for this analysis. First, the stimulus was converted into an *n* × *m* matrix (*n*, number of discrete disparity values used for the stimulus; *m*, number of trials) that contained the number of video-frames on which each disparity was presented per trial. Next, the trials were divided into “far” choice and “near” choice trials. The time-averaged psychophysical kernel was then computed as the difference between the mean matrix for “near” choice trials and that for “far” choice trials. We also computed a time-resolved psychophysical kernel as the psychophysical kernels for four nonoverlapping consecutive time bins (each of 375 ms duration) during the stimulus presentation period. Kernels were averaged across sessions, weighted by the number of trials in that session. The amplitude of the psychophysical kernels (PKAs) over time was calculated as the inner product between the time-averaged psychophysical kernel and the psychophysical kernel for each time bin. Kernel amplitudes separated by inferred decision confidence were then normalized by the maximum of the psychophysical kernel averaged across both conditions such that the relative differences between conditions remained. The SE of the amplitude was calculated by bootstrapping (1000 repeats). We also verified that our results were qualitatively similar when psychophysical kernels were computed using logistic regression (compare Yates et al., 2017).

##### Operationalizing decision confidence.

When viewed from a statistical perspective, decision confidence can be linked to several behavioral metrics, such as accuracy, discriminability, and choices on error or correct trials (Hangya et al., 2016) (see Fig. 5*b*). Here, we simulated an observer's decision variables on each trial analogously to Urai et al. (2017). The decision variable (*d*) was drawn from a normal distribution whose mean depended on the signed signal strength (with negative and positive signal reflecting near and far stimuli, respectively) and the SD on the observer's internal noise (22.8% signal, the median of the animals' psychophysical thresholds). The sign of the *d* determined the choice on each trial. Assuming a category boundary *c* = 0 (no bias), trial-by-trial confidence (the distance between the decision variable and the category boundary) was transformed into an average across trials percent correct (Lak et al., 2014) as follows:
where *f* is the cumulative density function of the normal distribution as follows:
To simulate the relationship between accuracy and confidence, we generated the *d* for 10^{8} trials, binned these based on the level of confidence (20 bins), and computed the accuracy for each bin. To examine the relationship between confidence and psychophysical performance, we performed a median split of the trials based on confidence and measured the psychometric function for high- and low-confidence trials. Finally, we calculated the mean confidence as a function of signal strength separately for correct and error trials.

##### Perceptual decision models.

To compare the animals' psychophysical kernels to different decision strategies, we simulated different perceptual decision models and calculated psychophysical kernels for the model data. For all simulations, only 0% signal trials were used, and the model “decision confidence” was defined as |decision variable| at the end of each trial, unless stated otherwise. PKAs were then computed separately for high- and low-confidence trials, after a median split based on this metric for decision confidence. To account for the imperfect relationship between pupil size and decision confidence, we allowed for noise (“confidence noise,” Gaussian additive noise ∼*N*(*0*, σ^{2}), where σ was scaled by the SD of the noise-free distribution of the confidence values) when assigning trials into the high- or low-confidence groups and fitting the model PKAs separated by confidence to the animals' data (compare results in Fig. 7). For this fitting procedure, we minimized the mean squared error using MATLAB fminsearchbnd. To compare the model performance, we used the Akaike Information Criterion (AIC) and normalized mean squared error (where the difference between model prediction and data point is divided by the SE of the respective data point).

##### Integration-to-bound model.

In this model, the decision variable (*d*) is computed as the integrated time-varying difference of the population response of two pools of sensory neurons. For the disparity discrimination task, these would consist of one pool preferring near disparities, the other preferring far disparities. We computed the time-varying population response as the dot product between the time-varying stimulus (analogous to that used in the experiments) and an idealized version of the animals' time-averaged psychophysical kernel. On each trial, once the decision variable reached a decision bound (at decision time, *t*) (Mazurek et al., 2003; Kiani et al., 2008), the decision variable was fixed at that value (absorbing bound) until the end of the trial. The choice of the model was based on sign (*d*) at the end of the trial. We used two approaches to derive decision confidence for this model. First, it was defined as |*d*| at the end of the trial. This approach ignores the decision time. This model had one free parameter (the height of the decision bound), which we varied to best account for the time courses of the PKAs for low- and high-confidence trials. In this model, all trials in which the decision bound was reached are assigned the same confidence. Second, we also generated predictions for the proposal that subjective confidence is higher for those trials in which the bound is reached earlier (Kiani and Shadlen, 2009; Kiani et al., 2014). Because our analysis only relied on the rank order of the trials based on confidence, our results are independent of how exactly this time is converted into confidence.

##### Neural sampling-based probabilistic inference model.

We used the model by Haefner et al. (2016), implemented for an orientation discrimination task. In this model, the decision is based on a belief over the correct decision (posterior probability over the correct decision), which is updated throughout each trial. The decision confidence was computed as |posterior probability −0.5|, which effectively reflects the distance of the posterior to the category boundary. To approximate the time course of the PKA for high- and low-confidence trials, we varied the strength of the feedback in the model, the contrast of the orientation-selective component of the stimulus and the trial duration. The parameters used to generate the sampling model predictions were the same as in the original paper (κ = 2, λ = 3, δ = 0.08, *n*_{s} = 20; stimulus contrast on each individual frame = 10) (Haefner et al., 2016) and only differed in the number of sensory neurons (*n*_{x} = 256, *n*_{g} = 64) to reduce computation time. The decreasing PKA in this model is the result of a feedback loop between the decision making area and the sensory representation.

##### Evidence accumulation toy-model (idealized attractor model).

To be able to systematically explore the predictions of attractor-based models for confidence-specific PKAs, we devised two simple abstract models. In the first, the decision variable *d _{t}* at time

*t*is defined as follows: where μ

*is the sensory evidence at time*

_{t}*t*, and α is an acceleration parameter of the accumulation process (compare Brunton et al., 2013). For α = 0, the model performs perfect integration. For α < 0, it is a leaky integrator; and for α > 0, the model implements a confirmation bias or attractor. In the second model, a variant of the previous one, the acceleration parameter α depends on a sigmoidal function of

*d*such that instead: For α > 0, the behavior of the

*d*can then be described by an attractor with a double-well energy landscape in which the minimum of each well corresponds to the choice attractors (compare Wimmer et al., 2015), a behavior also observed for the sampling model by Haefner et al. (2016).

_{t}##### Early sensory weighting model after Yates et al. (2017).

We simulated psychophysical model decisions based on sensory responses of a linear-nonlinear model. The linear stage consisted of two temporal filters (*k*, one for contrast, one for disparity) as follows:
The time-varying disparity stimulus and the stimulus contrast were each convolved with the temporal filter, and their sum (*x*(*t*)) was exponentiated to generate spike rates as follows:
The model parameters *a*, *b*, *t*_{max}, τ as well as the relative weights of the disparity and contrast kernels were chosen such that the dynamics of the output of the linear-nonlinear model approximately matched that of the average peristimulus time histogram neurons in area MT (Yates et al., 2017, their Fig. 3*b*). Starting from these initial values we then varied these model parameters to explore a range of adaptation levels as shown in Fig. 8). To simulate the decision process, we used two of these MT responses but with opposite tuning, and computed the decision variable (*d*(*t*)) as the integral of the difference of these time-varying MT responses. The decision on each trial was based on sign (*d*(*t*)) at the end of the trial, and decision confidence defined as |*d*| at the end of the trial.

We further explored an extension of this model to additionally account for the temporal autocorrelation of the spiking response, also after Yates et al. (2017). This variant was identical to the first, except that (1) we generated spikes based on the spike rates using a Poisson process; and (2) we included a spike history term such that:
where *h* (“history filter,” as in Yates et al., 2017, their Supplementary Math Note Fig. 1*c*) is the postspike weight that integrates the neuron's own spiking history (*r*(*t*−1)). This extension yielded similar results to the version without spike history (data not shown).

##### Inclusion criteria.

Data from a total of 436 sessions (300 and 136 sessions from Animal A and B, respectively) were included. Trials with fixation errors were excluded, thereby reducing the number of included trials from 874,641 to 590,050 successfully completed trials (Animal A: 409,597 trials; Animal B: 180,453 trials). Additionally, to ensure that the differences in pupil size modulation across conditions were not simply a consequence of systematic differences in the baseline pupil size across conditions, we equalized baseline pupil size between conditions by iteratively removing trials until the mean difference of the *z*-scored baseline pupil size values between conditions was ≤0.05. This baseline equalization was done separately for the following conditions. (1) To explore the effect of signal strength (see Fig. 3*c*), signal trials were divided into easy (≥50% signal) and hard (>0% and <10% signal) trials, and the baseline equalized between these conditions, thereby removing 2457 trials from Animal A and 409 trials from Animal B. (2) To compute PKAs (see Figs. 2*c*, 7), and to explore the effect of available reward size on pupil size modulation (see Fig. 3*b*), only no-signal (0% signal) trials were used. To avoid that our metric used to infer decision confidence (mean pupil size during the last 250 ms before stimulus) and the pupil size modulation for available reward size merely reflected differences in baseline pupil size across conditions, we first separated trials into two groups: small and large (including both intermediate and large) available reward trials. Within each reward-size group, we divided trials according to our pupil size metric (median split) into two subgroups and equalized baseline across these subgroups. In a second step, we equalized baseline across the two reward-size groups. This two step procedure removed 7237 trials from Animal A and 2478 trials from Animal B. Additionally, we only included sessions in which the trials per session remaining after baseline correction exceeded 600, and in which each experimental condition had at least 10 trials. For each session, three psychometric functions were computed (one using all the completed trials, one each including only trials for which the available reward size was large or small, respectively). We fitted cumulative Gaussians to each of these psychometric functions, and only sessions for which each of these fits explained >90% of the variance were included. This selection resulted in 213 sessions from Animal A (312,998 trials) and 84 sessions from Animal B (122,897 trials) that were included for analysis. For our analyses based on inferred decision confidence (see Figs. 5, 7), we only used the last 40 sessions for Animal B after sufficient learning (compare Fig. 4). In control analyses, we verified that all our results were similar when instead no inclusion criteria were applied and all 590,050 successfully completed trials used.

##### Data and code availability.

The code to reproduce the figures is available online at https://github.com/NienborgLab/Kawaguchi_et_al_2018, and the data at https://figshare.com/articles/Kawaguchi_et_al_2018/7076621.

## Results

### Integration-to-bound models predict characteristic differences in temporal sensory weighting when separating trials by decision confidence

Subjects during psychophysical discrimination tasks often give more weight to the early than late part of the stimulus presentation, even in tasks when both are equally informative about the correct answer (Kiani et al., 2008; Nienborg and Cumming, 2009; Yates et al., 2017). We refer to this behavior as early psychophysical weighting, and the standard computational account is that it reflects an integration-to-bound decision process (Kiani et al., 2008). In brief, this explanation suggests that subjects accumulate sensory evidence only up to a predefined bound, not only in reaction time tasks but also in tasks when the stimulus duration is fixed by the experimenter, and when a complete accumulation of evidence over the course of the entire trial would be optimal. As a result, sensory evidence is ignored after the internal bound is reached on a given trial and, together with a variable time at which this bound is reached, on average across trials, early evidence is weighted more strongly than evidence presented late in a trial. If this explanation for the observed early weighting is correct, then across trials in which the decision variable never reaches the bound, all evidence would be weighted equally, regardless of when it was presented during the trial.

Interestingly, for simple perceptual discrimination tasks, decision confidence can be defined statistically (Hangya et al., 2016) and directly linked to the decision variable. In an integration-to-bound model, it simply reflects the distance of the decision variable to the category boundary. Here, we exploited this link and systematically explored how the temporal weighting of the sensory stimulus should depend on decision confidence according to the integration-to-bound model. To do so, we categorized trials into high- or low-confidence trials (median split) and measured the temporal weighting of the sensory evidence as the PKA over time (see Materials and Methods) for each category. We compared these for high-confidence trials, low-confidence trials, and across all trials while systematically varying the decision bound of the model (Fig. 1). As expected, we found that the average PKA decreases more steeply if the decision bound is lower (Fig. 1*a–e*, black lines), indicating that the decision bound was reached earlier on average, and therefore the sensory evidence ignored from an earlier point in the trial. It is also intuitive that the PKA was typically larger for high- compared with low-confidence trials, reflecting the stronger sensory evidence, and hence confidence, on those trials. If the decision bound is low, the decision bound is reached on a large proportion of trials, and the assigned decision confidence identical. These trials are therefore randomly assigned to the high- and low-confidence category, resulting in the similarity of the PKAs (Fig. 1*a*). However, an interesting, nontrivial characteristic emerges for intermediate values of the decision bound (Fig. 1*b*,*c*). Relatively strong evidence early during the trial led to high-confidence and early reaching of the decision boundary, resulting in the pronounced decrease of the PKA for high-confidence trials. But for low-confidence trials, the PKA not only showed no decrease but an increase over time (Fig. 1*b–d*). As a result, the PKAs for high- and low-confidence trials crossed and the PKA for low-confidence trials exceeded that for high-confidence trials at the end of the stimulus presentation. Over a range of values of the decision bound, the difference between the PKA for high- and low-confidence trials was therefore negative (Fig. 1*f*). This characteristic behavior was even more pronounced when we defined decision confidence not only based on evidence but also decision time, as previously suggested (Kiani and Shadlen, 2009; Kiani et al., 2014) (compare Fig. 1*g–l*). A two race extension of the bounded integration model as used in van den Berg et al. (2016) produced similar results. Because our analysis depended only on the rank order of the decision confidence, these results hold generally, regardless of the relative weighting of time and evidence for decision confidence (see Materials and Methods). After sorting zero-signal trials by decision variable, the PKA cannot easily be interpreted as a weight on the stimulus. For instance, the temporal weights on any one trial are always a non-zero constant starting at the beginning of the trial, and zero after some point. As a result, the averaged weights across all trials must be decreasing. The fact that the PKA may be increasing is the result of sorting the trials by confidence which separates the stimulus distributions between the high-and the low-signal trials. Equally, the more pronounced early difference in PKAs for low-decision bounds (compare Fig. 1*a* with Fig. 1*g*) reflects the fact that, when decision confidence is based on both time and evidence, trials with stronger early sensory evidence, and hence early decision times, are assigned to the high-confidence category. Nonetheless, these simulations reveal characteristic predictions about how a particular statistic (the psychophysical kernel as measured by taking the difference between the choice-triggered averages) should vary as a function of confidence for a bounded decision formation process. We therefore next aimed to test these predictions in monkeys performing a visual discrimination task for which early psychophysical weighting was previously reported (Nienborg and Cumming, 2009).

### The animals exhibit early psychophysical weighting behavior in this task

Two macaque monkeys performed a coarse disparity discrimination task (Fig. 2*a*), similar to that described previously (Nienborg and Cumming, 2009). The animals initiated each trial by fixating on a small fixation marker; and after a delay of 500 ms, a dynamic random dot stimulus was presented for a fixed duration of 1500 ms. The stimulus was a circular random dot pattern defining a central disk and a surrounding annulus. The animals' task was to determine whether the disparity-varying center was either protruding (“near”) or receding (“far”) relative to the surrounding annulus. Following the stimulus presentations, two choice targets appeared above and below the fixation point: one symbolizing a “near” choice, the other a “far” choice. Importantly, the positions of the choice targets were randomized between trials such that the animals' choices were independent of their saccade direction. While the animals performed this task, we measured their eye positions and pupil size.

Our animals performed the task well (Fig. 2*b*). Similar to previous findings (e.g., Kiani et al., 2008; Nienborg and Cumming, 2009; Yates et al., 2017), the animals relied more strongly on the stimulus early than late during the stimulus presentation. We quantified this as a decrease in the PKA (see Materials and Methods) throughout the stimulus presentation (Fig. 2*c*). To test the model predictions separated by decision confidence in the animals' data, we therefore sought to devise an approach to infer the animals' decision confidence from pupil size measurements in this task.

### Pupil size is systematically associated with experimental covariates, consistent with pupil-linked changes in arousal

Pupil size has been linked to a subject's arousal in both humans (Bradley et al., 2008) and monkeys (Rudebeck et al., 2014; Ebitz and Platt, 2015; Suzuki et al., 2016; Mitz et al., 2017). Our animals performed a substantial number of trials in each session (mean; Animal A: 828; Animal B: 1067). We therefore wondered whether a signature of their decreasing motivation with increased satiation during the behavioral session could be found in the animals' pupil sizes. To this end, we split the trials of each session into five equally sized bins (quintiles) and computed the average pupil size aligned on stimulus onset (Fig. 3*a*). For these averages, only 0% signal trials on which the available reward size was small (see Materials and Methods) were used. Moreover, to allow for the detection of slow trends throughout the session, the pupil size data were not high-pass filtered for this analysis. We found that, in both animals, pupil size systematically decreased throughout the session, as expected for a decrease in arousal with decreased motivation or task engagement with progressive satiation.

We next explored the effect of varying the available reward size in a predictable way (see Materials and Methods). Consistent with previous results (Cicmil et al., 2015), the animals' psychophysical performance on large available reward trials exceeded that on small available reward trials (Fig. 3*d*). When averaging the time course of the pupil size for 0% signal trials separated by available reward size, we found that pupil size for large available reward trials increased progressively compared with that on small available reward trials (Fig. 3*b*). The animals were rewarded after correct choices following the stimulus presentation. The time course of this pupil size modulation with available reward size is therefore consistent with modulation related to the animals' expectation of the reward toward the end of the trial. Indeed, the difference in mean pupil with available reward size over the last 250 ms of the stimulus presentation was highly statistically reliable (Fig. 3*e*), similar to previous findings (Baruni et al., 2015).

Previous studies that revealed arousal linked pupil size modulation typically used long intertrial intervals (ITIs) lasting several seconds (Rudebeck et al., 2014; Ebitz and Platt, 2015; Suzuki et al., 2016; Mitz et al., 2017), which were deemed necessary to stabilize pupil size before stimulus or trial onset. Conversely, our task allowed for short ITIs (Animal A: 65–4772 ms, median: 136 ms; Animal B: 115–3933, median: 146 ms) to yield a large number of trials per session. Nonetheless, the above analyses revealed robust signatures of pupil size modulation with experimental manipulations of arousal also for this task.

Given the relatively short ITIs and the sluggishness of the pupil response, we performed a number of control analyses to verify that these results did not merely reflect effects from the preceding trials. First, we equalized the baseline pupil size before stimulus onset across conditions (see Materials and Methods). Second, we explored the effect of the preceding saccade direction, ITI, and the mean pupil size during the last 250 ms of the stimulus presentation on the preceding trial. While there was no systematic effect difference in pupil size as a function of the saccade direction on the preceding trial (*p* = 0.75 and *p* = 0.92 for Monkey A and Monkey B, respectively), there was a systematic correlation between ITI and mean pupil size in 1 animal (*p* = 0.50 and *p* = 0.002, for Monkey A and Monkey B, respectively), and between the mean pupil size on the preceding and current trial in both animals (*p* < 10^{−6} and *p* < 10^{−7}, for Monkey A and Monkey B, respectively; for the distribution of Spearman's correlation coefficients across sessions). We therefore performed additional analyses to verify that the effects of available reward size and task difficulty were found, even when ITI or mean pupil size on the preceding trial was matched. To this end, we divided trials into five groups of similar average ITI or mean pupil size on the preceding trial each and repeated the analysis (Fig. 3*b*,*c*) for each of these quintiles and found that the main characteristics in pupil size modulation were robust.

Previous work in humans found that pupil size increased with task difficulty, which is thought to reflect changes in arousal related to “cognitive load” or “mental effort” (Hess and Polt, 1964; Kahneman and Beatty, 1966; Alnæs et al., 2014). To explore whether such a signature was evident for our task, we divided our data into easy (≥50% signal) and hard trials (<10% signal, excluding 0% signal trials) (Fig. 3*c*). To remove effects of available reward size, this analysis was restricted to small available reward trials. Consistent with the expected modulation for cognitive load, pupil size in hard trials weakly exceeded that for easy trials in the initial period of the stimulus presentation (before ∼750 ms after stimulus onset). However, the more pronounced modulation with task difficulty occurred in the opposite direction toward the end of the trial.

Remarkably, plotting this modulation across training sessions revealed that this late modulation only emerged once the animals knew the task well (Fig. 4*a*) and was correlated with task performance (Fig. 4*b*). This late modulation appears to reflect the animals' expectation to receive a reward based on their knowledge of the probability of being correct given the stimulus difficulty. It might thus be interpretable as a modulation based on the animal's confidence to make the correct decision. We will show next that this modulation indeed exhibits established key signatures (Hangya et al., 2016; Urai et al., 2017) of decision confidence, supporting this interpretation.

### Pupil size in this task can be used to infer the animal's decision confidence

For a two-alternative sensory discrimination task analogous to the one used here, decision confidence is monotonically related to the distance to a category boundary (Kepecs et al., 2008; Hangya et al., 2016), that is, the integrated sensory evidence, as schematically shown in Figure 5*a*. From a statistical perspective, decision confidence in such discrimination tasks should be systematically associated with evidence discriminability, accuracy, and choice outcome (model predictions in Fig. 5*b*). Empirically, we found that mean pupil size during the 250 ms before stimulus offset showed the three characteristics of statistical decision confidence keeping reward size constant (Fig. 5*c*): we restricted these analyses to small available reward trials to eliminate the effect of available reward size. The findings were qualitatively the same when only analyzing large available reward trials (Fig. 5*d*). First, in both animals, pupil size was correlated with performance accuracy (Fig. 5*c*,*d*, left column; *p* < 10^{−4} and *p* < 0.01 for Animal A and B, respectively, Spearman's rank correlation). Second, when separating trials based on pupil size (median split), the animals showed better discrimination performance for trials on which pupil size was larger, as expected for improved evidence discrimination with higher decision confidence (Hangya et al., 2016) (Fig. 5*c*,*d*, middle column; *p* < 10^{−3} and *p* = 0.014 for Animal A and B, respectively, by resampling). Third, as predicted, when separating correct and error trials, decision confidence increased on correct and decreased on error trials (Fig. 5*c*,*d*, right column; *p* < 10^{−5} and *p* < 0.01 for Animal A and *p* < 0.001 and *p* < 0.05 for Animal B in Fig. 5*c* and Fig. 5*d*, respectively; Spearman's rank correlation with the model predictions in Fig. 5*b*). Interestingly, we also observe a slight increase in pupil size with signal strength for higher signal strengths in Animal B. Such a pattern is expected if decision confidence is informed not only by the strength of the sensory evidence, as described above, but also by decision time as observed in human observers (Kiani et al., 2014; see also Adler and Ma, 2017) for how increasing confidence ratings may actually be compatible with Bayesian confidence. Indeed, fits of the model by Kiani et al. (2014) correlated well with the data (*p* < 10^{−4} and *p* < 0.01 for Animal A, and *p* < 10^{−6} and *p* < 0.01 for Animal B in Fig. 5*c* and Fig. 5*d*, respectively; Spearman's rank correlation).

To explore when within the trial pupil size could be used to infer decision confidence, we systematically repeated the statistical analyses in Figures 3*c* and 5 when varying the time within the trial and the duration over which pupil size was averaged (Fig. 6). The results show that pupil size toward the end of the trial over a range of analysis windows could be used to infer decision confidence.

Because we used a white fixation marker, our results with pupil size measurements might in principle have been affected by the animals' fixation precision. To control for this potential confound, we therefore performed a number of control sessions in which, instead of a white fixation dot, we used a black fixation marker. If our results were mostly driven by differences in luminance resulting from differences in fixation precision across conditions, the modulation with our experimental covariates should reverse. However, our results were robust when, instead of a white fixation marker, we used a black fixation marker (Fig. 3*f–h*). Together, these analyses support our conclusion that mean pupil size at the end of the stimulus presentation can be used to infer the animals' decision confidence.

### The animals' data separated by inferred decision confidence support the predictions of the integration-to-bound model

Having established the relationship between pupil size and decision confidence in our task, we now use it to test the confidence-related predictions of the integration-to-bound model using our data. To do so, we computed the animals' psychophysical kernels separately after categorizing high- or low-inferred decision confidence trials (median split based on the pupil size metric). For inferred high-confidence trials, we observed a decrease in PKA for both monkeys (Fig. 7*a–d*). In contrast, for inferred low-confidence trials, the PKA either stayed relatively constant throughout the trial (Monkey B, Fig. 7*a–d*, bottom row), or first increased and then decreased (Monkey A, Fig. 7*a–d*, top row). Furthermore, the PKA at the end of low-confidence trials was approximately equal (Monkey B) or higher (Monkey A) than the PKA for high-confidence trials. We then fit the two variants of the integration-to-bound model (compare Fig. 1) while allowing for noise in our assignment of trials as high versus low confidence (to account for the imperfect relationship between pupil size and decision confidence) before computing the models' PKAs for high- and low-confidence trials (Fig. 7*a*,*b*). Importantly, the data for both monkeys best agree with the predictions of an integration-to-bound model when subjective confidence is based on both evidence and time (Kiani et al., 2014) (Fig. 7*b*,*e*,*f*) with the difference between the 2 animals explainable by slightly differing internal integration bounds (compare Fig. 1*i* and Fig. 1*j*), as well as different levels of noise to infer decision confidence (Table 1). Interestingly, the noise to infer decision confidence is lower for Animal A, which is plausible given this animal's more extensive learning of the task (compare Fig. 4).

We next wondered whether the data were also explainable by two alternative accounts of the early psychophysical weighting: (1) models with attractor dynamics resulting from recurrent feedback; or (2) a purely feedforward account that includes adaptation.

To test the first alternative account, we implemented a model (Haefner et al., 2016) in which the decrease of the PKA results from self-reinforcing feedback from decision neurons to sensory neurons. Because of its recurrent connectivity, this model exhibits attractor dynamics, in which early evidence is effectively weighted more strongly than evidence presented late in the trial. Other recurrent models of perceptual decision making, whether across cortical hierarchies (Wimmer et al., 2015), or proposing attractor dynamics within the decision area itself (Wang, 2002; Wong et al., 2007), share this attractor behavior. In these models, the behavior of the decision variable after stimulus onset can be described by a double-well energy landscape, where the minimum of each well corresponds to a choice attractor (compare Wimmer et al., 2015, their Fig. 2*d*, inset). As a result, the effect of early evidence on the decision variable will be amplified by the subsequent pull exerted by whatever attractor toward which the early evidence had pushed the decision variable. While this behavior resembles that of the integration-to-bound model, it differs in its predictions when separating trials according to confidence (Fig. 7*c*). Analogous to our fits of the integration-to-bound models, we included a noise parameter to allow for an imperfect assignment of trials to the high- or low-confidence group when fitting this model to the monkeys' data. These fits were worse than those for the integration-to-bound models (Fig. 7*c*,*e*,*f*). Specifically, we were unable to identify model parameters for which the kernel amplitude in low-confidence trials exceeded that for high-confidence trials at the end of the stimulus presentation (Fig. 8*a*). To convince ourselves that an attractor dynamic by itself is indeed unable to account for our data, we confirmed this finding for two idealized attractor models in which attractor strength and hence slope of the PKA were determined by a single parameter (similar to the integration-to-bound model; Fig. 8*b*,*c*). As for the neural sampling-based probabilistic inference model, varying this parameter did not yield kernels for which the kernel amplitude in low-confidence trials exceeded that for high-confidence trials at the end of the stimulus presentation. Indeed, in the absence of confidence noise, the only way to achieve a similar late-trial PKA for high and low confidence was to strengthen the attractor dynamics in one of the models to a degree that made the late-trial PKA close to zero, in contradiction to the data (Fig. 8*c*).

Finally, we tested the behavior of two versions of an early sensory weighting model after Yates et al. (2017, their Figs. 4*a*, 6*a*), in which the decrease in PKA results from adaptation of the sensory responses in a purely feedforward way. The model generates choices based on the integrated inputs of stimulus-selective sensory neurons, whose response decreases over the time of the stimulus presentation. Such decrease in response amplitude after response onset is typically observed for sensory neurons and may reflect a gain control mechanism or stimulus-dependent adaptation. As expected, we found a decreasing PKA across all trials. But like for the attractor-based models investigated above, and unlike for our data, the amplitude of the high-confidence PKA was consistently larger than the low-confidence PKA (Fig. 7*d*). As for the previous model fits, we additionally included a noise parameter to allow for an imperfect assignment of trials into the high- and low-confidence groups. This pattern remained unchanged over a wide range of model parameters that yielded plausible sensory responses (compare Fig. 8*d*).

Together, these results indicate that, while each of these models could account for early psychophysical weighting, a decision bound was necessary to account for the monkeys' behavioral differences with inferred decision confidence.

Given the importance of a decision bound to account for the data, we explored the cost of a decision bound on performance in our task (Fig. 9). The psychophysical performance is shown for models with different decision bounds and best in the absence of a decision bound, as expected (Fig. 9*a*, red curve). The cost on performance (percent correct) resulting from the decision bound is depicted in Figure 9*b*. The vertical red bar marks the range of the animals' decision bounds obtained from the model fitting (compare Fig. 7*a*,*b*). The performance for this value has reached asymptotic values (exceeding 95% maximal performance), suggesting that the cost on performance for the animals is small (see Discussion).

## Discussion

The frequently observed (Kiani et al., 2008; Neri and Levi, 2008; Nienborg and Cumming, 2009; Yates et al., 2017) early weighting of sensory evidence in perceptual decision making tasks has classically been explained to reflect an integration-to-bound decision process (Mazurek et al., 2003; Kiani et al., 2008). Here, we first derived decision confidence-specific predictions for this account. Second, to test these predictions, we devised a metric based on pupil size that allowed us to estimate 2 macaques' subjective decision confidence on individual trials without the use of a wagering paradigm. Finally, we compared our confidence-specific data with two alternative accounts of early weighting (attractor dynamics and response adaptation) and found that neither of those models could explain our data. This combined approach provided new insights into the animals' decision formation process. It revealed that the frequently observed (Kiani et al., 2008; Neri and Levi, 2008; Nienborg and Cumming, 2009; Yates et al., 2017) early weighting of the sensory evidence was largely restricted to high-confidence trials, approximately consistent with findings in humans (Zylberberg et al., 2012), and that the shape of the PKA confirmed our predictions based on the integration-to-bound model. Indeed, the match between data and model was best when we incorporated a recent proposal about how subjective confidence was not just based on the strength of the presented evidence, but also integration time (Kiani et al., 2014). Moreover, our data could not be fully explained by other computational accounts for early psychophysical weighting, such as sensory adaptation (Yates et al., 2017) or models of perceptual decision making with recurrent processing (Wong et al., 2007; Wimmer et al., 2015; Haefner et al., 2016). We note that our findings do not preclude the contribution of these alternative models. However, our results highlight that none of these accounts is sufficient to explain the data by itself and that a decision rule that implements an early stopping of the evidence integration process appears necessary.

Our analysis of pupil size showed that, even without the stabilizing effect of long ITIs, pupil size was reliably correlated with experimental covariates and could be used to infer the animal's decision confidence. The correlation of pupil size with decision confidence is similar to that in a recent psychophysical study in humans (Krishnamurthy et al., 2017) that queried decision confidence directly. As we did here, this study found a positive correlation between subjects' pupil size before they made their judgment and their reported decision confidence. Previous work inferring an animal's decision confidence typically relied on behavioral measurements, such as postdecision wagering (Kiani and Shadlen, 2009; Komura et al., 2013) and the time an animal is willing to wait for a reward (Lak et al., 2014), which increases the complexity of the behavioral paradigm and hence the required training of the animals. To our knowledge, the present study is the first to relate pupil size measurements in animals to decision confidence. Such a pupil size-based metric opens up studies of decision making in animals to include decision confidence without increasing the complexity of the behavioral paradigm.

In our task, the animals were rewarded on each trial directly after making their choice. Consistent with modulation of pupil-linked arousal due to reward expectation (Baruni et al., 2015; Varazzani et al., 2015), pupil size was progressively larger toward the end of the trial when the (known) available reward was large compared with when it was small (compare Fig. 3*b*). Such reward-based interpretation of the pupil size modulation associated with decision confidence may explain our findings here and those of Krishnamurthy et al. (2017), which contrasts with studies associating increases in pupil size with uncertainty (e.g., Satterthwaite et al., 2007; Nassar et al., 2012; Lempert et al., 2015; de Berker et al., 2016; Urai et al., 2017). Specifically, a recent study (Urai et al., 2017) observed the opposite relationship between inferred decision confidence and pupil size, measured after the subject's perceptual report: larger pupil size after the subject's report, and before receiving feedback, was associated with higher decision uncertainty. Access to information (e.g., whether or not a choice is correct) can be rewarding by itself (Behrens et al., 2007; Bromberg-Martin and Hikosaka, 2009). It may therefore be that, in Urai et al. (2017), the reward was such access to information (i.e., the feedback on each trial). When the confidence about the correct choice is low, the information is more valuable, hence resulting in the observed negative correlation with pupil size. Alternatively, this discrepancy may also reflect methodological differences, such as the time interval used for the analysis (before or after the choice was made) (but see also Lempert et al., 2015). More generally, these findings underscore the importance to consider a subject's motivational context when interpreting pupil size modulation.

Moreover, pupil size modulation by cognitive factors has been linked to a number of neural circuits mirroring the complexity of the signal. These include the locus coeruleus noradrenergic system (Aston-Jones and Cohen, 2005; Joshi et al., 2016), a brain-wide neuromodulatory system involved in arousal, the inferior and superior colliculi, which mediate a subject's orienting response to salient stimuli (Wang et al., 2012; Wang and Munoz, 2015), but the dopaminergic system has also been implicated (Lak et al., 2017; Colizoli et al., 2018), and there is evidence for an association with cholinergic modulation (Polack et al., 2013; Reimer et al., 2016), which is also linked to attention.

The emergence of a reliable signature of decision confidence required that the animals performed the task well (compare Fig. 4). We propose two possible, not mutually exclusive, accounts for this. First, in line with the notion that the observed pupil size modulation linked to decision confidence is driven in part by reward expectation, it may reflect the animal's improved knowledge of the timing of the task and hence the anticipation of the reward. Second, it may reflect the fact that to engage the pupil-linked arousal circuitry a certain threshold of decision confidence needs to be exceeded. Such an interpretation would mean that, once the signature of decision confidence emerges, a higher level of decision confidence is reached at least on some trials.

Our animals' psychophysical behavior separated by inferred decision confidence was well described by a bounded accumulation decision process. These results imply that in a subset of trials sensory evidence was ignored after a certain level of decision confidence had been gained. We find that, in our task, across all difficulty levels, the loss in performance is small for the bounds required to explain our data (Fig. 9). Because the overall loss will differ between different experiments, it might explain some of the differences seen in the temporal profile of PKAs across studies (e.g., Kiani et al., 2008; Neri and Levi, 2008; Nienborg and Cumming, 2009; Wyart et al., 2012; Brunton et al., 2013; Drugowitsch et al., 2016; Yates et al., 2017).

## Footnotes

This work was supported by an European Research Council Starting Independent Researcher Grant (NEUROOPTOGEN) to H.N., by a grant from the Deutsche Forschungsgemeinschaft (within the CRC 1233 Robust Vision, University of Tuebingen) to H.N., funds of the Deutsche Forschungsgemeinschaft to the Centre for Integrative Neuroscience (Grant EXC 307), and by an NEI grant EY028811 to R.M.H.

The authors declare no competing financial interests.

- Correspondence should be addressed to Dr. Hendrikje Nienborg, University of Tuebingen, Werner Reichardt Centre for Integrative Neuroscience, 72076 Tuebingen, Germany. hnienb{at}gmail.com