Abstract
Adaptation is sometimes viewed as a process in which the nervous system learns to predict and cancel effects of a novel environment, returning movements to near baseline (unperturbed) conditions. An alternate view is that cancellation is not the goal of adaptation. Rather, the goal is to maximize performance in that environment. If performance criteria are well defined, theory allows one to predict the reoptimized trajectory. For example, if velocitydependent forces perturb the hand perpendicular to the direction of a reaching movement, the best reach plan is not a straight line but a curved path that appears to overcompensate for the forces. If this environment is stochastic (changing from trial to trial), the reoptimized plan should take into account this uncertainty, removing the overcompensation. If the stochastic environment is zeromean, peak velocities should increase to allow for more time to approach the target. Finally, if one is reaching through a viapoint, the optimum plan in a zeromean deterministic environment is a smooth movement but in a zeromean stochastic environment is a segmented movement. We observed all of these tendencies in how people adapt to novel environments. Therefore, motor control in a novel environment is not a process of perturbation cancellation. Rather, the process resembles reoptimization: through practice in the novel environment, we learn internal models that predict sensory consequences of motor commands. Through rewardbased optimization, we use the internal model to search for a better movement plan to minimize implicit motor costs and maximize rewards.
Introduction
Studies of motor adaptation often rely on externally imposed perturbations to induce errors in behavior. For example, in a reaching task, a robotic arm may be used to introduce perturbations. On each trial, “error” is measured as the difference between the observed trajectory on that trial and some average behavior in a baseline condition (before perturbations were imposed). Implicit in many works is the idea that adaptation proceeds by reducing such errors. However, this approach makes the fundamental assumption that the baseline movements are somehow the optimal movements in all conditions. This assumption is demonstrably false. For example, when a springlike force makes it so that the path of minimum resistance between two points is not a straight line but a curved path, people adapt to the novel dynamics by reaching along that curved path (Uno et al., 1989; Chib et al., 2006). Therefore, at least in these extreme cases, motor error as classically defined does not drive adaptation.
The above example highlights the idea that the purpose of our movements is, at least to a first approximation, to acquire rewarding states (e.g., reach the end point accurately) at a minimum cost. Each environment has its own cost and reward structure. The trajectory or feedback response that was optimum in one environment is unlikely to remain optimum in the new environment (Wang et al., 2001; Burdet et al., 2001; Diedrichsen, 2007; Emken et al., 2007).
Recent advances in optimal control theory (Todorov, 2005) allowed us to revisit the well studied reach adaptation paradigm of force fields and make some theoretical predictions about what the adapted trajectory should look like in each field. In this framework, the problem is to maximize performance. To do so, one of the required steps is to identify (i.e., build a model of) the novel environment so one can accurately predict the sensory consequences of motor commands. A second required step is to use this internal model to find the best movement plan.
As we explored the theory, we found that it made rather interesting predictions in conditions in which the force field was stochastic. Stochastic behavior of an environment introduced uncertainty in the internal model, and the controller that attempted to maximize performance took this uncertainty into account as it generated movement plans. An intuitive example is lifting a cup of hot coffee in which a lid obscures the amount of liquid. If one cannot see the amount of liquid, one is uncertain about its mass. Lifting and drinking from this cup will tend to be slower, particularly as it reaches our mouth.
In the forcefield task, one can introduce uncertainty by making the environment stochastic. To maximize performance (reach to the target in time), the theory takes into account this uncertainty and reoptimizes the reach plan. We asked whether adaptation proceeded by returning trajectories toward a baseline or whether adaptation more resembled a process of reoptimization.
Materials and Methods
Our volunteers were healthy righthanded individuals [26.5 ± 5.2 years old (mean ± SD)]. Protocols were approved by the Johns Hopkins School of Medicine Institutional Review Board, and all subjects signed a consent form. Volunteers sat on a chair in front of a robotic arm and held its handle (Hwang et al., 2003) that housed a lightemitting diode (LED). A white screen was positioned immediately above the horizontal plane of the robot/arm on which an overhead projector (refresh rate, 70 Hz; EP739; Optoma, Milpitas, CA) painted the screen. In experiment 1, the projector displayed a cursor to represent hand position, and the LED was off. In all other experiments, the handle LED was on, and it represented hand position. Movements were made only to a single direction (90°, straight a way from the body along a line perpendicular to the frontal plane), along the midline of the subject's body.
Experiment 1.
In this experiment, we repeated the standard forcefield adaptation paradigm (Shadmehr and MussaIvaldi, 1994). Subjects (n = 28) trained for 3 consecutive days, practicing reaching in a single direction. They were provided with a target at 9 cm (target was a 5 × 5 mm square) and were rewarded (via a target explosion) if they completed their movement in 450 ± 50 ms. Timing feedback was provided in the form of a bluecolored target if the movement was slower than required. After completion of the movement, the robot pulled the hand back to the starting position.
On day 1, the experiment started with a familiarization block of 150 trials without any force perturbations (null). This was followed by four blocks of field training. The second and third days each consisted of four blocks of field training (each block was 150 trials). Overall, subjects performed ∼2400 field trials.
Let us label the forces produced by the robot as f = Dẋ, where ẋ is hand velocity in Cartesian coordinates. On each trial, D was drawn from a normal distribution such that f = D̄(1 + δ)ẋ = D̄ẋ + D̄δẋ, where D̄ is the mean of the distribution (see below) and δ is a normally distributed scalar random variable with zero mean and variance σ^{2}. The subjects were divided in four groups. For the first and second groups, the variance of the field was zero (i.e., the field did not vary from trial to trial): for group 1 (n = 7), D̄ = [0, 13; −13, 0] Ns/m [a clockwise (CW) perturbation]; for group 2 (n = 7), D̄ = [0, −13; 13, 0] Ns/m [a counterclockwise (CCW) perturbation]. Group 3 (n = 7) and group 4 (n = 7) practiced in a field with the same mean as groups 1 and 2, respectively, but with σ = 0.3. Groups 3 and 4 experienced an additional block of 50 movements at the end of the third day. The variance of the field was set to zero during this last block. The data from this last block allowed us to compare the behavior of the subjects from groups 1 and 2 with groups 3 and 4 in precisely the same environment.
Experiment 2.
In this experiment, we simply presented a field that had a zero mean but a nonzero variance. The distance between the starting position and the target was 18 cm. The target was projected as a box, the size of which was 5 × 5 mm. The movement direction was the same as that in experiment 1. Subjects (n = 18) were instructed to complete their reach within 600 ± 50 ms. A score indicating the number of successful trials was displayed on the screen. To encourage performance, we paid the subjects based on their score.
This experiment was composed of eight blocks of trials (each block was 150 trials) on a single day. The first block was a familiarization null block with no force perturbations. This was followed by seven field training blocks. The force field was defined as f = Dδẋ, where D = [0, 13; −13, 0] Nm/s and δ are a normally distributed random variable with zero mean and variance σ^{2}. The subjects were divided in two groups. In the small variance group, for the first four blocks of field training, σ = 0.3. In the large variance group, for the first four blocks of training, σ = 0.6. For the remaining three blocks, σ = 0.
Experiment 3.
To further test how variance in the environment affected movement planning, we considered a reaching task through a viapoint. The viapoint target and the final target were positioned at a distance of 9 and 18 cm from the starting position, respectively. Each target was 5 × 5 mm. The starting, viapoint, and final targets were positioned on a straight line, requiring a movement at 90°. The subjects (n = 22) were instructed to reach by passing through the viapoint target at t = 400 ± 50 ms and then stop at the final target no later than t = 1.0 s. If both timing constraints were met, then both targets exploded at the completion of the movement. Otherwise, at the completion of the movement, the subject was provided timing feedback for the viapoint with arrows and for the final target by color. No timing feedback was provided during the movement. A score indicating the number of explosions was constantly displayed on the screen. To encourage performance, we paid the subjects based on their score.
This experiment was composed of eight blocks of trials (each block was 150 trials) on a single day. The first block was a familiarization null block with no force perturbations. This was followed by seven field training blocks. The force field was defined as f = Dδẋ, where D̄ = [0, 13; −13, 0] Nm/s and δ are a normally distributed random variable with zero mean and variance σ^{2}. The subjects were divided in two groups. In the small variance group, for the first four blocks of field training, σ = 0.3. In the large variance group, for the first four blocks of field training, σ = 0.6. For the remaining three blocks, σ = 0.
Movement analysis.
Movement initiation was defined as the time when the hand velocity crossed the threshold of 3 cm/s. Hand paths were aligned at the starting position. In the first experiment, we computed overcompensation by forming a difference trajectory between the null and field conditions: for each subject, we computed the average hand path over the last 50 trials of field training and then subtracted the xposition (axis perpendicular to the direction of the target) of this trajectory from the xposition of the subject's own average hand path in the last 50 trials of the null field. To compute how changes in field variance affected hand speed in experiments 2 and 3, we computed an average speed profile from the mean hand path of the last 50 trials of each block. The speed profiles were normalized with the peak speed in the first familiarization block to remove the effect of personal maximum speed biases on the average speed profile.
Modeling and simulations.
We used stochastic optimal feedback control (OFC) to model reaching (Todorov and Jordan, 2002; Todorov, 2005). In this framework, the trajectory of a reach is determined by three components: an optimal controller that generates motor commands, an internal model that predicts the sensory consequences of those commands, and a motor plant/environment that reacts to those sensory consequences. Noise is signal dependent with an SD that grows with the size of the motor commands (Harris and Wolpert, 1998; Jones et al., 2002; van Beers et al., 2004). Our theoretical work here is novel only in the sense that it considers the problem of optimal control in the context of uncertainty about the internal model. To tackle this problem, we extended the approach introduced by Todorov (2005). We found that the problem of model uncertainty was a dual to the problem of control with signaldependent noise. The mathematics that we used to solve the uncertainty problem is very similar to those used by Todorov (2005) to solve the signaldependent noise problem. Here, we only outline the procedures and leave the derivations for the supplemental material (available at www.jneurosci.org).
Consider a linear dynamical system: x_{t} _{+ 1} = Ax_{t} + Bu_{t}. Here, x_{t} is the state of the system at time t, u_{t} is the control signal input to the system, and the matrices A and B are the dynamics of the system. Previous studies have used deterministic model parameters A and B, leaving no room for representation of the learner's uncertainty about these parameters. Here, we represent the model parameter A as a stochastic variable, leading to the following equation: where V is a Gaussian random variable with mean zero and variance Q_{v}. One can see that the uncertainty in parameter A is statedependent noise. We derived the following optimal feedback controller and optimal state estimator with model noise: where y_{t} is the observation made by the system, H is the observation matrix, and ξ_{t}, ε_{t}^{i}, and ω_{t} are Gaussian random variables with mean 0 and variance 1 representing the additive and multiplicative state variability and the measurement noise, respectively. C_{i} is the scaling matrices for the state and controldependent noise for each noise source ε_{t}^{i}. Q_{t} is the weight matrix of state cost, and R is the weight matrix of motor cost. x_{t} is the actual state of the system that is not available to the controller. The controller only has an estimate of the state x̂_{t} available through the state estimation process. For analytical tractability, the state is assumed to be updated according to a linear recursive filter: x̂_{t} _{+ 1} = Ax̂_{t} + Bu_{t} + K_{t} (y_{t} −Hx̂_{t}), where K_{t} is the Kalman gain. The optimal control policy is of the following form: where L_{t} is the time varying feedbackgain matrix that determines the controller's response.
Optimal control provides closedform solutions for only linear dynamical systems. We therefore modeled the arm for the single direction of movement as a point mass in Cartesian coordinates. The components of state were x(t) = [p_{x}(t), ṗ_{x}(t), p_{y}(t), ṗ_{y}(t), f_{x}(t), f_{y}(t), T_{x}, T_{y}], where p is hand position, f is force, and T is target position. The cost function was as follows: where MT is the desired movement time and MT_{H} is the time interval after movement completion for which the controller is supposed to hold position at the target.
Whereas in the mathematics we could solve the problem only for the case in which variance was a measure of withintrial noise in the parameter D, because of safety concerns, in our experiments we held the noise constant during a trial and only changed it from trial to trial.
The forcefield parameters used in the simulations were the same as those used for the experiments. We found that even with extensive training, subjects learn only ∼80% of the field. For example, in channel trials in which we and others have measured the forces that subjects produce, the force trajectory is at most 80–82% of the imposed field (Scheidt et al., 2000; Hwang et al., 2006; Smith et al., 2006). To account for this, in the simulations, the environment produced forces that were identical to those produced by our robot, but adaptation of the subject was modeled as an internal model that predicted a fraction of these forces D̂ = αD̄ + γD̄, where α is the fraction and γ is a normally distributed random variable with 0 mean and variance σ^{2}. For simulations of the viapoint task, we set D̂ = γD̄ (because the mean of the field was zero), and the state was extended to hold the viapoint positions TV_{x} and TV_{y}. The cost for the viapoint task simulations were as follows: where MT_{v} is the “viapoint time.”
Results
Model predictions: reaching in a deterministic or stochastic field
In the null field, the optimal control policy to move a mass to a target in a given amount of time is a straightline trajectory (Fig. 1a, dashedline trajectory) with a bellshaped velocity profile. However, if the mass is moving in a velocitydependent curl field that pushes it perpendicular to its direction of movement, then the best policy is a slightly curved movement (Fig. 1a, trajectory marked by “l”) that overcompensates for the initial curl forces. That is, if the field pushes the hand to the right, the optimum policy is a trajectory that is to the left of baseline.
To see the rationale for this, we plotted the forces produced by the optimal controller and compared it to forces that must be produced if the mass is to move along a straightline, minimumjerk trajectory (Fig. 1b). The optimal controller produced less total force (Fig. 1b, ∫f^{T}fdt) because by overcompensating early into the movement, when speeds were small, it could rely on the environmental forces to bring the mass back toward the target. Therefore, the curved, apparently overcompensating trajectory actually produced smaller total forces than a straight trajectory.
We arrived at this result by assuming that the learner had formed a perfect model of the force field. If the learner's model predicted <100% of the effects of the field, then the combined effect of the controller and the environment are a more complicated trajectory. For example, suppose that the actual field is represented by f = Dẋ and the learner's estimate is D̂ = αD for 0 ≤ α ≤ 1. When α is small (e.g., 0.2), the field overpowers the controller and pushes the mass in the direction of its perturbation (Fig. 1a). Because the controller acquires a more accurate forward model (α becomes closer to 1), the trajectory becomes “S” shaped, displaying an apparent overcompensation. Previous work suggests that with training, subjects acquire a model accuracy of ∼0.8 (Scheidt et al., 2000; Hwang et al., 2006; Smith et al., 2006). That is, in channel trials in which force output is quantified, the peak is ∼80% of the field.
We produced these results using a linear, pointmass model of dynamics. We wondered whether overcompensation was also present for a more realistic nonlinear model of the twolink arm. In this case, closedform solutions are not possible, but Li and Todorov (2007) have provided tools that aid the search process. We simulated reaching to various directions using a deterministic, nonlinear model of dynamics and again found that the controller produced an overcompensation in all directions (see supplemental material, available at www.jneurosci.org).
The above results were in a condition in which the task dynamics were invariant from trial to trial. How should planning change when one is uncertain of the strength of the force field? That is, what is the best way to perform the task if the field is stochastic? We returned to the linear model of dynamics and derived the closedform solution for this stochastic optimal control problem (see supplemental material, available at www.jneurosci.org). We found that the overcompensation in the controller was a function of its uncertainty: as the variance of the field increased, the controller produced smaller amounts of overcompensation and had larger errors in the direction of the field (Fig. 1c). Furthermore, the simulations uncovered an interesting prediction: as field variance increased, the optimum plan no longer had a bellshaped speed profile. Rather, the speed profile became skewed with a peak that was larger (Fig. 1c), slowing the hand as it approached the goal (i.e., the target).
In summary, theory predicted that in a deterministic curl field, the optimal trajectory is a slightly curved hand path that appears to overcompensate for the forces. In a stochastic field, the optimum trajectory loses its overcompensation tendencies, peak velocities become larger, and the timing of the peak shifts earlier in time, allowing the hand to approach the target more slowly.
Experiment 1: deterministic versus stochastic curl fields
In experiment 1, subjects experienced either a CW (groups 1 and 3) or a CCW (groups 2 and 4) velocitydependent curl field. In groups 1 and 2, the field was constant from trial to trial. In groups 3 and 4, the field had the same mean as in the constant group but had a nonzero variance.
Trajectories in the constant field did not return to the straight paths recorded in the null condition. Rather, they tended to show an overcompensation. Data from two subjects during various stages of adaptation are shown in Figure 2a. The hand paths tended to overcompensate early into the movement and then slightly undercompensate as the hand approached the target, resulting in Sshaped paths. This basic result was reported before (Thoroughman and Shadmehr, 2000). The only novelty here is that we see that this shape was attained on the first day of training and was maintained for the duration of the 3 d experiment.
By the end of training on day 1, the success rates had reached levels observed in the null condition (Fig. 2b), and rates tended to improve with more days of training (repeatedmeasures ANOVA for the two groups and 3 d; effect of day: F_{(2,2)} = 5.9, p = 0.009; effect of group: F_{(1,12)} = 2.76, p = 0.12; interaction, F_{(2,28)} = 1.67, p = 0.21). We quantified overcompensation as the withinsubject maximum perpendicular displacement from their null trajectory (Fig. 2c). Overcompensation would imply a negative measure for the CW group and a positive measure for the CCW group. The timing of this measure was early in the movement: in the CW group, at 149, 139, and 154 ms (SD for each day, ∼40 ms); and in the CCW group, at 176, 180, and 179 ms (SD, ∼22 ms). For the CCW group, this measure was significantly greater than zero for all 3 d (paired t test; df = 6; p < 0.05 for each day). For the CW group, this measure was significantly less than zero for all 3 d (paired t test; df = 6; p < 0.05 for each day). Importantly, when we examined performance of each subject on each day, we found a significant correlation between overcompensation and success rates (CW field: r = 0.51, p < 0.02; CCW field: r = 0.61, p < 0.005). Therefore, both groups had hand trajectories that appeared to overcompensate for the field. As the overcompensation increased, there was a tendency for performance to improve.
Theory had predicted that overcompensation should disappear and velocity peaks should rise as the field acquires stochastic properties. We trained a new set of volunteers (groups 3 and 4) in a field that had the same mean as in groups 1 and 2, respectively, but with a nonzero variance. Figure 3a shows the average hand trajectories in the last 50 trials of each day for subjects who trained in the deterministic (σ_{0}) or stochastic (σ_{L}) CW fields. In the stochastic field, movements gradually lost their overcompensation (Fig. 3b) (overcompensation in the σ_{0} vs σ_{L} group on day 3; p < 0.05; df = 12). There was one additional data point for the last block on the last day of training for the σ_{L} group. During this block, subjects from the σ_{L} group were exposed to a zero variance field. That is, in this block of 50 trials, the environments for the σ_{0} and σ_{L} groups were identical. Despite this, the overcompensation remained significantly smaller for the stochastic group (p < 0.05; df = 12). Because the σ_{L} groups kept the success rate high in the test block, the change of the trajectory is not a result of interference from the force perturbations.
Another prediction of the theory was that hand trajectories in the σ_{L} group should show larger errors in the direction of the force perturbations (Fig. 1c): we quantified the interaction between the early overcompensation and the late undercompensation as the withinsubject difference in the signed area enclosed between the hand trajectory in the training session and the hand trajectory in null session (Fig. 3c, inset, A1–A2). Simulations had predicted that with increased field variance, this parameter should become more positive. We observed this tendency in our subjects: for the parameter A1–A2, the differences between σ_{0} and σ_{L} were significantly different in day 3 (p < 0.02; df = 12), as was the difference between σ_{0} (day 3) and σ_{L} (day 3: test) (p < 0.02). As overcompensation declined, performance in the σ_{L} group improved (Fig. 3d).
Finally, the model had predicted that the stochastic and deterministic groups would show different speed profiles (Fig. 1c). Figure 3f shows the speed profiles for the last training block on each day of training, and Figure 3e shows the peak speed distribution on each day. The speed profiles were normalized with respect to the individual peak speed in the last 50 trials of the null block on day 1. By day 3, the σ_{L} group displayed a hand speed that was skewed and had a higher peak value (p < 0.05; df = 12). Even in the same environment (day 3 test), the peak speed was significantly larger in σ_{L} versus σ_{0} (p < 0.01; df = 12).
The same tendencies were observed when subjects were trained in a stochastic CCW field (Fig. 4): overcompensation disappeared (Fig. 4a,b), the measure A1–A2 became more negative (Fig. 4c) as performance rates improved (Fig. 4d), and peak speeds increased (Fig. 4e). For example, the average withinsubject correlation between overcompensation and success rates was −0.52 (p < 0.01). That is, people responded to the increased field variance by eliminating their overcompensation and producing an increasing peak speed that skewed the profile and slowed the hand as it approached the target.
Experiment 2: deterministic versus stochastic zeromean fields
The reduced overcompensation in the highvariance field was consistent with the optimal policy, but it may have been attributable to another cause: if the field is more variable, people may learn it less well (Donchin et al., 2003). Although this would not explain the increased speed in the σ_{L} group, it can account for the reduced overcompensation. This motivated us to test the theory further. In experiment 2, the task was to reach to a target at 18 cm. Importantly, the field was now zeromean with zero, small, or large variance. Because the mean of the field was zero in all blocks, this helped ensure that any differences that we might see regarding control policies should be attributable to the variance of the field and not a bias in learning of the mean.
Volunteers (n = 18) were separated into two groups: a group that experienced a field with small variability σ_{S} and a group that experienced a field with larger variability σ_{L}. They began with a familiarization set (150 trials, no forces), followed by four sets of training in a zeromean stochastic field, followed by three sets of a zeromean deterministic field σ_{0}. Figure 5 shows the mean hand paths and speed during the last 50 trials of each condition. The hand paths were essentially straight in the σ_{0} condition. With increased variance, there was a small tendency for hand trajectory to curve to the right of the null, but this tendency did not reach significance (maximum displacement from null, p > 0.1 for both groups; note that the difference in scales on the x and yaxes). However, as the theory had predicted, peak speeds gradually increased when the field became variable (Fig. 5b,c). Furthermore, the timing of the peak speed shifted earlier in the trajectory of the hand: from 243 ms in the baseline (zerovariance) condition to 231 and 209 ms in the σ_{S} and σ_{L} conditions, respectively (withinsubject change across all subjects: main effect of condition, p = 0.005; for the σ_{L} group: main effect of condition, p = 0.012). A more detailed view of these changes in peak speeds is shown in Figure 5c. Increased variance produced a gradual increase in peak speed. Return to zero variance resulted in a gradual reduction in peak speed.
In summary, when we considered an environment in which the mean perturbation was zero, people responded to the unpredictability of the environment by gradually increasing their peak speeds and skewing the speed profile to reduce speed as the hand approached the target.
Experiment 3: viapoint in a stochastic zeromean field
The theory explained that increased uncertainty should make one more cautious as the movement approaches the target. In our previous examples, the target was always at the end of the movement. If one has a target in the middle of a movement, then increased uncertainty should make the movement through that target change as uncertainty increases. We tested this idea in experiment 3. Here, the task was to reach to an end point target (at 18 cm) by passing through a viapoint (at 9 cm), as illustrated in Figure 6a. The field was zeromean with either a zero or a nonzero variance. The experiment began with a few blocks of no forces. For another few blocks, a field was introduced that, from trial to trial, had zero mean but nonzero variance. Intuitively, we expected that the optimal trajectory in a lowuncertainty field should be a straight line with a bellshaped velocity profile. As field uncertainty increased, the movement should slow down as it approaches the viapoint, effectively producing a segmented movement.
Figure 6a shows the predictions of the theory for various levels of field variance σ_{0} = 0, σ_{S} = 0.3, and σ_{L} = 0.6. The predicted hand paths remained straight, unaffected by the different levels of variance (Fig. 6a). However, as the field variance increased, the controller separated the two movements into “segments,” showing a dip in velocity as the mass approached the viapoint.
To test these predictions, volunteers (n = 22) were divided into two groups. They began with a familiarization set (150 trials, no forces), followed by four sets of training in a zero mean but variable force field σ_{L} = 0.6 or σ_{L} = 0.3, followed by three sets of zero mean, zero variance force fields σ_{0} = 0. Figure 6b shows the mean hand paths and speed during the last 50 trials of each condition. The hand paths were essentially straight in the σ_{0} condition with a speed profile that had a single peak. With increased variance, we did not detect a significant difference in the mean of hand paths across subjects (maximum displacement from null, p > 0.1), but the speed showed two peaks, suggesting of a segmentation of the reach. Indeed, the speed at viapoint (0.4 s) in the σ_{0} condition was significantly higher than the σ_{L} and σ_{S} conditions (p < 0.01 in each case).
Discussion
If the purpose of a movement is to acquire a rewarding state at a minimum cost (Todorov and Jordan, 2002), then the idea that the brain computes a desired movement trajectory and that trajectory remains invariant with respect to environmental dynamics is untenable. Rather, a broad implication of OFC theory is that when the environment changes, the learner performs two computations simultaneously: (1) finds a more accurate model of how motor commands produce changes in sensory states; and (2) uses that model to find a better movement plan that reoptimizes performance. Here, we performed a number of experiments to test this idea.
In experiment 1, we revisited the well studied reaching task in which velocitydependent forces act perpendicular to the direction of motion. Thoroughman and Shadmehr (2000) reported that with adaptation, hand paths curved out beyond the baseline trajectory, suggesting an overcompensation in the forces that subjects produced. They interpreted those results as a characteristic of the basis functions with which the brain might be approximating the force field (Donchin et al., 2003; Wainscott et al., 2005). However, experiments that measured hand forces in channel trials found that the maximum force was, at most, 80% of the field (Scheidt et al., 2000, 2001; Smith et al., 2006). How could one produce hand paths that appeared to overcompensate for the field, yet only produce a fraction of the field forces? Here, OFC solved this puzzle. It explained that both the curved hand path and the undercompensation of the peak force were signatures of minimizing total motor costs of the reach.
We next considered an environment in which the dynamics were stochastic. Intuitively, the idea is that if we do not know the amount of coffee in a cup (or how hot it may be), we lift it and drink from it differently than if we are certain of its contents. For example, Chhabra and Jacobs (2006) found that when motor noise was artificially increased, subjects learned to alter their control policy in stabilizing a tool. Inspired by these results, we slightly extended the mathematics of OFC (see supplemental material, available at www.jneurosci.org) to make predictions about how reach plans should change when the learner is uncertain about the dynamics of the environment. If the force field is stochastic, theory predicted that overcompensation should disappear and peak speeds should increase. We observed both tendencies.
In experiment 2, we tested the theory further by considering an environment that, on average, had no perturbations but that had a variance that could be zero, small, or large. Theory predicted that as field variance increased, the trajectory should shift from a bellshaped speed profile to one that showed a larger peak earlier in the movement, allowing more time to control the limb near the target. We observed this tendency.
Finally, in experiment 3, we considered a task in which one of the goals was to go through a viapoint that was positioned along a straight line to the final target. As field variance increased, theory predicted that one should slow down as the hand approaches the viapoint. That is, movements should exhibit a single peak in their speed profiles when field variance was zero, but multiple peaks as the field variance increased. Indeed, when faced with a zero variance field, movements had a single peak. With increased variance, subjects segmented their movements into two submovements.
The idea that movement planning should depend on the dynamics of the task seems intuitive. For example, Fitts (1954) observed that changing the weight of a pen affected how people planned their reaching movements: to maintain accuracy, people moved the heavier pen more slowly than the lighter pen. In reach adaptation experiments, however, movement time is often constrained by the experimenter, and so we had thought that a fully adapted trajectory would return to baseline conditions (Shadmehr and MussaIvaldi, 1994). Our results here reject the notion of an invariant desired trajectory. Instead, the results are consistent with a drive to optimize movements in terms of their motor costs and accuracy (Todorov and Jordan, 2002).
In the theory, uncertainty about the magnitude of a velocitydependent field corresponds to a velocitydependent noise. Given this noise, one should minimize speed at the taskrelevant areas: at the viapoint and at the end point. In the simple reach task, when subjects were uncertain about the field, they reached with skewed speed profiles that had a higher peak and a longer tail, resulting in slower speeds near the target. In the viapoint task, the increased uncertainty resulted in movements that had a segmented appearance, slowing at the viapoint.
Previous models of movement planning successfully explained smoothness of reaching and eye movements using costs such as end point variance (Harris and Wolpert, 1998, 2006) or change in muscle forces or torques (Uno et al., 1989). These two models are closely related. For example, minimum torque change is equal to minimum motor command when the time constant of muscle activation is embedded in the model. The minimum variance is equal to the minimum weighted sum of motor commands when signaldependent noise is assumed. This notion was supported by the simulations of Thoroughman et al. (2007), in which large overcompensation was predicted by both minimum torque change and minimum variance models in the forcefield task. Because these models all minimize the sum of motor commands, the typical OFC problem that has motor costs is closely related to these two models. However, OFC is a more appropriate framework for simulations in motor control because it not only allows one to consider feedback (e.g., noise in sensory measurements), but also because it allows one to consider uncertainty associated with internal models that predict the sensory feedback.
Despite this, our model is a diagram. It represented the hand as a point mass, the costs and rewards as quadratic functions, and uncertainty as statedependent noise when, in fact, the field was noisy from trial to trial, not within a trial. These are symptoms of our desire to solve mathematical problems analytically. Does the data match the predictions because of some fundamental truth in the model, or because of some unexplained quirk in the unmodeled dynamics? We approached this question in two ways. First, we tried to minimize the influence of unmodeled dynamics by considering environments that, on any given trial, had zero expected value. Second, we tested the same model in separate experiments in which environments had different means and the goals of the tasks were near (experiment 1), far (experiment 2), or both (experiment 3) in time or space. The consistency of the observations is some reassurance that reoptimization is a better model of motor control than perturbation cancellation toward a desired trajectory.
There are significant limitations in our current abilities to apply the theory to biological movements. For example, the theory faces significant hurdles when we consider that there are multiple feedback loops in the biological motor control system, namely the statedependent response of muscles, spinal reflex pathways, as well as the longloop pathways. In a more realistic setting, it is unclear what is meant by a motor command and a motor cost. Therefore, the best that we can currently claim is that our experimental results are difficult to explain with the notion of an invariant desired trajectory, but qualitatively in agreement with the theory.
We have implied that with training, people learn a forward model of the task and then use that model to form a better movement plan (Hwang and Shadmehr, 2005). However, it is possible to form optimum policies from the reward prediction errors on each trial without forming an explicit forward model. In our view, such an approach would be inconsistent with the large body of data from experiments in generalization (Conditt et al., 1997).
The cerebellum appears to be the key structure for computing a forward model: cerebellar agenesis produces a striking deficit in the ability to predict and compensate for consequences of one's own motor commands (Nowak et al., 2007), cerebellar damage impairs the ability to adapt reaching (Maschke et al., 2004; Smith and Shadmehr, 2005) and throwing movements (Martin et al., 1996), and reversible disruption of cerebellar output pathways to the cortex produces withinsubject impairments in reach adaptation (Chen et al., 2006). Assuming that the cerebellum is crucial for forming a more accurate forward model, how does the brain use this model to reoptimize movements?
Because the search for a better movement plan (or control policy) is a problem that depends on costs and rewards of the task, and dopamine appears to be a crucial neurotransmitter that responds to reward prediction errors (Schultz et al., 1997) and reward uncertainty (Fiorillo et al., 2003), it is possible that the process of finding an optimal control policy depends on the basal ganglia. Recently, Mazzoni et al. (2007) demonstrated that the slowness of movements in Parkinson's disease may be understood in terms of an imbalance in the cost function of an optimal controller. They suggested that in these patients, the motor costs relative to expected rewards had become unusually large. Basal ganglia patients typically show some ability to adapt to force fields (Krebs et al., 2001; Smith and Shadmehr, 2005). However, a strong prediction of the current theory is that they will be impaired in reoptimizing their movements.
In summary, our results support the hypothesis that that control of action proceeds via two related pathways: on the one hand, adaptation produces a more accurate estimate of the sensory consequences of the motor commands (i.e., learn an accurate forward model), and on the other hand, our brain searches for a better movement plan so to minimize an implicit motor cost and maximize rewards (i.e., find an optimum controller).
Footnotes

This work was supported by National Institutes of Health Grant NS37422 and by a grant from the United States Israel Binational Science Foundation. J.I. was supported by a grant from the Japan Society for the Promotion of Science.
 Correspondence should be addressed to Dr. Jun Izawa, Department of Biomedical Engineering, Johns Hopkins University, 416 Traylor Building, 720 Rutland Avenue, Baltimore, MD 212052195. jizawa{at}jhu.edu