## Abstract

Storing and recalling spiking sequences is a general problem the brain needs to solve. It is, however, unclear what type of biologically plausible learning rule is suited to learn a wide class of spatiotemporal activity patterns in a robust way. Here we consider a recurrent network of stochastic spiking neurons composed of both visible and hidden neurons. We derive a generic learning rule that is matched to the neural dynamics by minimizing an upper bound on the Kullback–Leibler divergence from the target distribution to the model distribution. The derived learning rule is consistent with spike-timing dependent plasticity in that a presynaptic spike preceding a postsynaptic spike elicits potentiation while otherwise depression emerges. Furthermore, the learning rule for synapses that target visible neurons can be matched to the recently proposed voltage-triplet rule. The learning rule for synapses that target hidden neurons is modulated by a global factor, which shares properties with astrocytes and gives rise to testable predictions.

## Introduction

Increasing experimental evidence in different brain areas shows that precise spike timing can be learned and reliably reproduced over trials. For example, in adult songbirds who learned to repeat the same song, HVC neurons, which are targeting the premotor area RA, reproduce precise spiking patterns during the song production (Hahnloser et al., 2002). In the rat, repeated presentation of a moving spot imprints a stereotypical spiking activity in the visual cortex that can be retrieved after learning (Xu et al., 2012). However, it remains unclear how those spiking patterns can be efficiently learned through synaptic plasticity.

Learning to autonomously reproduce a given spatiotemporal activity pattern is a fundamental challenge approached by the earliest models of recurrent neural networks (Amari, 1972; Hopfield, 1982; Herz et al., 1988; Gerstner et al., 1993; Horn et al., 2000). However, the proposed simple temporal Hebbian rule could be problematic because of its lack of robustness during recall (Hopfield, 1982). Since then, heuristic models for supervised sequence learning in recurrent networks have also been developed for spiking neurons (Ponulak and Kasiński, 2010). All these studies suffer from the same fundamental problem: the synaptic learning rule (storage) is not matched to the neural dynamics (recall) in the sense that the plasticity rule is not derived from first principles that are formulated in terms of neural dynamics.

Another limitation of existing models for sequence learning with spiking neurons (Barber and Agakov, 2002) is the restricted class of spiking patterns that can be produced with only visible neurons, i.e., neurons that receive the supervising signal. One possible solution is to include a reservoir of hidden neurons, which do not receive a supervised teaching signal, but the synaptic weights between those neurons are usually fixed (Maass and Markram, 2002). Other approaches, such as the Boltzmann machine (Ackley et al., 1985), Helmholtz machine (Dayan et al., 1995), or their extensions to the temporal domain (Hinton et al., 1995; Sutskever et al., 2009) consider learning rules for the weights toward hidden neurons, but the biological plausibility of those rules is open to discussion.

Here we study a general framework with both visible and hidden neurons. In a local neural circuit, neurons that may receive strong input (teaching signal) from outside the circuit are considered to be visible, and neurons that receive only input from neurons inside the circuit are considered to be hidden. We propose a generic synaptic learning rule that is matched to the neural dynamics and that can be adapted to a wide range of neuron models. The learning rule minimizes an upper bound on the Kullback–Leibler (KL) divergence from the target spiking distribution to the distribution produced by the network. This learning rule is consistent with spike-timing dependent plasticity (STDP; Bi and Poo, 1998). The match of recall and storage appears as an explicit link between the time constants in the learning window and the neural time constants. Furthermore, the plasticity rule for synapses that target visible neurons is consistent with the voltage-triplet rule (Clopath et al., 2010). Finally, beside the presynaptic and postsynaptic components, the learning rule for synapses that target hidden neurons is modulated by a global factor. Interestingly, this global factor shares common properties with astrocytes.

## Materials and Methods

##### Neuron model.

We consider a recurrent network of *N* spiking neurons over a duration of *T* time bins. Spiking of neuron *i* is characterized by the spike train *x _{i}*, with

*x*(

_{i}*t*) = 1 if a spike is emitted at time step

*t*, and

*x*(

_{i}*t*) = 0 otherwise. The membrane potential of neuron

*i*is described as in the spike-response model (Gerstner and Kistler, 2002): where

*w*is the synaptic strength from neuron

_{ij}*j*to neuron

*i*,

*x*

_{k}

^{α}(

*t*) = Σ

_{s=1}

^{∞}α(

*s*)

*x*(

_{k}*t*−

*s*) represents the convolution of spike train

*x*with kernel α and

_{k}*u*

_{0}is the resting potential. The postsynaptic kernel is characterized by ε(

*s*) = δ

_{s,1}for the one time step response kernel scenarios of Figures 2 and 3, whereas in Figure 4 it is given by ε(

*s*) = (exp(−

*s*/τ

_{1}) − exp(−

*s*/τ

_{2}))/(τ

_{1}− τ

_{2}) for

*s*≥ 0 and the adaptation kernel is characterized by κ(

*s*) =

*c*exp(−

*s*/τ

*) for*

_{r}*s*≥ 0, with both kernels vanishing for

*s*< 0. For the clarity of the exposition, we chose such a simple neural model description. Note, however, that almost any neural model could be considered (e.g., conductance-based models). The only constraint is that the dynamical model should be linear in the weights, i.e., any dynamical model of the form

*u̇*=

_{i}*f*(

_{i}*u*) + Σ

_{i}*(*

_{j}w_{ij}g_{ij}*u*,

_{i}*x*) is suitable.

_{j}Consistently with the stochastic spike-response model or equivalently the generalized linear model (GLM; Pillow and Latham, 2008), noise is modeled by stochastic spiking given the (noise-free) membrane potential *u* in Equation 1, i.e., the probability that neuron *i* emits a spike in time bin *t* is a function of its membrane potential: *P*(*x _{i}*(

*t*) = 1 |

*u*(

_{i}*t*)) = ρ(

*u*(

_{i}*t*)). We stress the fact that given its own membrane potential, the spiking process is conditionally independent of the spiking of all the other neurons at this time. Due to this conditional independence, the probability that the network with (recurrent) weight matrix

*w*is generating the spike trains

*x*= (

*x*

_{1}, …,

*x*) can be calculated explicitly as the product of the probabilities for each individual spike train, hence a product of factors ρ(

_{N}*u*(

_{i}*t*)) and (1 − ρ(

*u*(

_{i}*t*)), depending on whether neuron

*i*did or did not fire at time

*t*, respectively. Abbreviating ρ

*(*

_{i}*t*) = ρ(

*u*(

_{i}*t*)), this amounts for the log-likelihood (Pfister et al. (2006)) as follows: Unless mentioned otherwise, the firing probability will be assumed to be a sigmoidal ρ(

*u*) = 1/(1 + exp(−β

*u*)), with parameter β controlling the level of stochasticity. We introduced this parameter for convenience: for given weights

*w*, the stochasticity of the network can be varied by changing the parameter β, which multiplies the weights. In the limit β → ∞ each neuron acts like a threshold unit and therefore makes the network deterministic.

##### Learning task.

We separate our *N* neurons into disjoint classes of *N*_{v} visible and *N*_{h} hidden neurons. Correspondingly, the spike trains *x* are separated into those generated by the visible and hidden neurons, *x* = (*v*, *h*). We assume that learning in the recurrent network consists of adapting all the synaptic weights *w _{ij}* between and among the two types of neurons such that the KL divergence
from the target distribution

*P**(

*v*) to the model distribution

*P*(

_{w}*v*) of the spike trains of visible neurons becomes as small as possible (Fig. 1). Gradient descent would amount to change the synaptic strength proportionally to the gradient of the negative KL divergence in Equation 3, using the fact that ∂/∂

*w*log

_{ij}*P*(

_{w}*v*) = Σ

*(*

_{h}P_{w}*h*|

*v*) ∂/∂

*w*log

_{ij}*P*(

_{w}*v*,

*h*). Unfortunately, sampling the sequences of hidden states given a sequence of visible states as suggested by Equation 4,

*h*∼

*P*(

_{w}*h*|

*v*), is tricky, since a certain hidden state at time

*t*should be consistent with all visible states, in particular those at later points in time. How to select such hidden states without violating causality is unclear. To circumvent this problem, we suggest minimizing instead an upper bound of the KL divergence. In an alternative approach, using a Monte Carlo Markov Chain to sample from

*P*(

_{w}*h*|

*v*), Mishchenko and Paninski (2011) propose sampling by forward–backward algorithm from the conditional distribution of one hidden neuron spike train given all the other spike trains. It is, however, unclear how this can be implemented in biologically plausible neural networks.

##### Learning by minimizing an upper bound on the KL divergence.

To define a biologically plausible sampling procedure we make use of the fact that the firing of each neuron is independent of the activity in the other neurons given the past (cf., Eq. 2). This allows us to factorize the probability for visible and hidden spike sequences into *P _{w}*(

*v*,

*h*) =

*R*(

_{w}*v*|

*h*)

*Q*(

_{w}*h*|

*v*), where The factor

*Q*(

_{w}*h*|

*v*) describes the probability of a hidden activity pattern, when only considering the past hidden and visible activity in each time step. Note that in general

*Q*(

_{w}*h*|

*v*) ≠

*P*(

_{w}*h*|

*v*) =

*P*(

_{w}*v*,

*h*)/Σ

*(*

_{h}P_{w}*v*,

*h*), since in

*P*(

_{w}*h*|

*v*) the whole visible activity pattern (past and future) is considered in each time step. Similarly,

*R*(

_{w}*v*|

*h*) describes the probability of a visible activity pattern when considering only the past. To obtain samples

*h*∼

*Q*(

_{w}*h*|

*v*), one runs the natural dynamics for the hidden neurons (Eq. 1 including stochastic spiking) while the visible neurons are clamped to

*v*∼

*P**(

*v*). Based on this sampling procedure we introduce the upper bound of the KL divergence, To prove the inequality we note that

*P*(

_{w}*h*|

*v*) =

*R*(

_{w}*v*|

*h*)

*Q*(

_{w}*h*|

*v*)/

*P*(

_{w}*v*). Using the definition and positiveness of the KL divergence we find that 0 ≤ (

*Q*(

_{w}*h*|

*v*)‖

*P*(

_{w}*h*|

*v*)) = log

*P*(

_{w}*v*) − 〈log

*R*(

_{w}*v*|

*h*)〉

_{Qw}_{(}

_{h}_{v}_{)}and conclude that −〈log

*R*(

_{w}*v*|

*h*)〉

_{Qw}_{(}

_{h}_{v}_{)}≥ −log

*P*(

_{w}*v*). Averaging this last inequality across

*P**(

*v*) and subtracting the target entropy

*H** = −〈log

*P**(

*v*)〉

_{P}_{*(}

_{v}_{)}, we obtain Equation 6.

Note that due to the KL properties this bound is tight if and only if *Q _{w}* (

*h*|

*v*) =

*P*(

_{w}*h*|

*v*). This is the case if either the activity in the hidden neurons has no effect on the visible activity, i.e.,

*R*(

_{w}*v*|

*h*) =

*P*(

_{w}*v*) and hence

*P*(

_{w}*h*|

*v*) =

*P*(

_{w}*v*,

*h*)/

*P*(

_{w}*v*) =

*R*(

_{w}*v*|

*h*)

*Q*(

_{w}*h*|

*v*)/

*P*(

_{w}*v*) =

*Q*(

_{w}*h*|

*v*), or if the dynamics in the hidden neurons is deterministic, i.e.,

*P*(

_{w}*h*|

*v*) =

*Q*(

_{w}*h*|

*v*) = δ

_{h,h(}

_{v}_{)}for some function h(

*v*). Note that with

*N*

_{h}= (

*k*− 1)

*N*

_{v}, any factorizable Markov chain Π

*(*

_{i}P_{w}*v*(

_{i}*t*) |

*v*(

*t*− 1),…,

*v*(

*t*−

*k*)) of order

*k*, which can be parameterized by

*P*(

_{w}*v*(

_{i}*t*) |

*v*(

*t*−1), …,

*v*(

*t*−

*k*)) (cf., Eqs. 1 and 2) is implementable with deterministic dynamics in the hidden neurons: the first group of

*N*

_{v}hidden neurons are driven by the visible neurons such that their activity is the same as the visible activity one time step in the past, the second group of

*N*

_{v}hidden neurons is driven by the first hidden group such that their activity is the same as the visible activity two time steps in the past and so forth, i.e.

*h*

_{(}

_{k}_{−1)Nv+i}(

*t*) =

*v*(

_{i}*t*−

*k*) (see Fig. 3

*B*).

##### Deriving the batch learning rule.

The negative gradient of the upper bound in Equation. 6 is evaluated to the following:
where *r̄* can be any arbitrary constant since *r̄* can be chosen to minimize the variance of the second term in Equation 7. Practically, this will be approximated by *r̄* ≈ 〈log*R _{w}*(

*v*|

*h*)〉

_{Qw(h|v)P*(v)}. To justify this choice we note that the optimal value for

*r̄*can be found as follows: let

*e*=

*Q*(

_{w}*h*|

*v*), 〈

*e*〉 = 0, Var((

*r*−

*r̄*)

*e*) = 〈

*r*

^{2}

*e*

^{2}− 2

*rr̄e*

^{2}+

*r̄*

^{2}

*e*

^{2}〉 − (〈

*re*〉 − 〈

*r̄e*〉)

^{2}⇒

*r*−

*r̄*)

*e*) = − 2〈

*re*

^{2}〉 + 2

*r̄*〈

*e*

^{2}〉 = 0 ⇒

*r̄*=

*re*

^{2}〉 ≈ 〈

*e*

^{2}〉 one can approximate

*r̄*≈ 〈

*r*〉. Note that in contrast to Equation 4, the hidden spike sequences in Equation 7 can now be naturally sampled.

To deduce from Equation 7 a learning rule of the form Δ*w _{ij}* ∝ − ∂/∂

*w*we first have to distinguish between weights

_{ij}*w*projecting onto visible neurons (

_{ij}*i*= 1, …,

*N*

_{v}), which we call visible weights and weights projecting onto hidden neurons (

*i*=

*N*

_{v}+ 1, …,

*N*) − the hidden weights. Due to the conditioning,

*R*(

_{w}*v*|

*h*) does not depend on the hidden weights and

*Q*(

_{w}*h*|

*v*) does not depend on the visible weights (see Eq. 5). Hence, if the postsynaptic neuron

*i*is visible, the second term in Equation 7 vanishes, and if it is hidden, the first term vanishes. Given the explicit form of the log-likelihoods log

*R*(

_{w}*v*|

*h*) and log

*Q*(

_{w}*h*|

*v*) as in Equation 2 we can directly take the derivatives in Equation 7 and obtain the batch learning rule as follows: where η is the learning rate and

*g*(

_{i}*t*) =

*(*

_{i}*t*) =

*d*ρ(

*u*)/

*du*|

_{u=ui(t)}. For the sigmoidal transfer function ρ(

*u*) = 1/(1 + exp(−β

*u*)), which is a reasonable and common choice when dealing with binary neurons, the prefactor

*g*(

_{i}*t*) equals β, since

*d*ρ(

*u*)/

*du*= βρ(

*u*)(1 − ρ(

*u*)). In batch learning (Eq. 8) the weights are adapted only after the presentation of several spike patterns

*v*. The visible neurons are clamped with spike trains sampled from the distribution

*v*∼

*P**(

*v*), and the hidden neurons follow the neural dynamics,

*h*∼

*Q*(

_{w}*h*|

*v*). As can be seen in Equation 8, the learning rule is different for visible synapses and for hidden synapses. This stands in contrast to our previous work (Brea et al., 2011), where the learning rule is identical for both types of synapses.

In the absence of hidden neurons the learning rule is identical to the one proposed in Pfister et al. (2006) for feedforward networks. The generalization to recurrent networks with hidden neurons, which we consider here, is feasible because of the conditional independence of firing (Eqs. 2 and 5). It should be noted that, even though the learning problem was formulated as “minimize the KL divergence from target to model distribution,” this minimization can be implemented with synapses that have only access to local information (presynaptic: *x*_{j}^{ε}(*t*), postsynaptic: *g _{i}*(

*t*)(

*x*(

_{i}*t*) − ρ

_{i}(

*t*))) and in the case of hidden synapses one global signal (log

*R*(

_{w}*v*|

*h*) −

*r̄*). We will assume that each synapse has direct access to the postsynaptic spiking information

*x*(

_{i}*t*) as well as the probability of spiking ρ

*(*

_{i}*t*).

##### On-line learning rule.

To obtain an on-line learning rule, which updates the synaptic weights at each time step, we need to replace the temporal summations in the batch rule (Eq. 8) by leaky integrators. For the synapse-specific part in Equation 8 this leads to the synaptic eligibility trace
with γ_{1} = 1/*T*. Similarly, log *R _{w}*(

*v*|

*h*) is replaced by Note that log

*R*(

_{w}*v*|

*h*) is given by the expression in Equation 2 with the summation over neurons from 1 to

*N*replaced by a summation over visible neurons 1 to

*N*

_{v}. The identical time constant γ

_{1}= 1/

*T*of the leaky integrators in Equation 9 and in Equation 10 reflects the fact that in Equation 8 the terms

*g*(

_{i}*t*)(

*x*(

_{i}*t*) − ρ

_{i}(

*t*))

*x*

_{j}

^{ε}(

*t*) are summed over

*t*= 1, …,

*T*and so are the terms in log

*R*(

_{w}*v*|

*h*) (cf., Eq. 2). Finally, the term 〈log

*R*(

_{w}*v*|

*h*)〉

_{Qw}_{(}

_{h}_{|}

_{v}_{)}

_{P}_{*(}

_{v}_{)}can be estimated as follows: where the time constant γ

_{2}of this leaky integrator is much larger than γ

_{1}to estimate the average 〈log

*R*(

_{w}*v*|

*h*)〉

_{Qw}_{(h|}

_{v}_{)}

_{P}_{*(}

_{v}_{)}. With those dynamic quantities, the synaptic learning rule in Equation 8 now becomes an on-line rule of the form In Equation 9 the term

*g*(

_{i}*t*) (

*x*(

_{i}*t*) − ρ

*(*

_{i}*t*)) corresponds to the postsynaptic and

*x*

_{j}

^{ε}(

*t*) to the presynaptic contribution to the weight change. The product of these two terms is low-pass filtered (Eq. 9) to form the eligibility trace, which in the case of hidden neurons is multiplied by the global factor (

*r*(

*t*) −

*r̄*(

*t*)) (see Eq. 12). It is interesting to note that the global factor (

*r*(

*t*) −

*r̄*(

*t*)) ≃ (log

*R*(

_{w}*v*|

*h*) − 〈log

*R*(

_{w}*v*|

*h*)〉) can be seen as an “internal reward signal”, which depends on how much more than the average a given hidden activity

*h*helps to produce the visible activity

*v*. We call the signal internal, since it is provided by the visible neurons and not by an external teacher.

##### Initial condition of the dynamics.

In the derivation of the learning rule we assumed that ρ* _{i}*(

*t*) (see Eq. 2 and surrounding paragraph) is given at every moment in time

*t*. For times

*t*larger than the neural time constant, ρ

*(*

_{i}*t*) is fully determined by the spiking activity of the network. However, at the beginning of each pattern presentation or recall, ρ

_{i}(

*t*) would also be influenced by spikes that occurred earlier, i.e., before time

*t*= 1. In practice, for patterns with only visible neurons we took the spikes of the last pattern presentation or recall to initialize the dynamics. In systems with hidden units the whole system was effectively clamped to a predefined spatiotemporal pattern before each pattern presentation or recall, except in Figure 2

*E*, where the system also converged without reset to the initial state.

##### Recall.

Recall means sampling from the model distribution *P _{w}*(

*x*) and in particular that the visible activity patterns are distributed as

*P*(

_{w}*v*). Therefore, the learning rule minimizes the upper bound (

*P*(

_{w}*v*)‖

*P*(

_{w}*v*)) ≥ (

*P*(

_{w}*v*)‖

*P*(

_{w}*v*)) = 0 during recall. If the bound is tight, i.e., (

*P*(

_{w}*v*)‖

*P*(

_{w}*v*)) = (

*P*(

_{w}*v*)‖

*P*(

_{w}*v*)) = 0, the gradient ▿(

*P*(

_{w}*v*)‖

*P*(

_{w}*v*)) equals zero, because (

*P*(

_{w}*v*)‖

*P*(

_{w}*v*)) is in a local minimum. Therefore the weight change is zero on average, i.e., 〈Δ

*w*〉 = 0. However, unless

*P*(

_{w}*x*) is deterministic, the variance Var(Δ

*w*) is not zero and therefore diffusion takes place. If (

*P*(

_{w}*v*)‖

*P*(

_{w}*v*)) > (

*P*(

_{w}*v*)‖

*P*(

_{w}*v*)) an additional drift is expected to occur. For the simulations in Figure 3 we compared the time it takes to “forget,” i.e., drift away from the final value by 0.1 bits, with the time it takes to “relearn” the target. For Markovian targets (

*k*= 1) learning was ∼100 times faster than forgetting. For non-Markovian targets (

*k*> 1), where performance heavily depends on the hidden weights, learning was ∼6 times faster than forgetting.

##### Linear separability and Markovianity of sequences.

In the one time step response kernel scenario of Figure 2 with deterministic target sequences the notion of linear separability and Markovianity proves helpful for classification. Suppose the target sequence requires a neuron *i* at time *t* to be active *x _{i}*(

*t*) = 1 (silent

*x*(

_{i}*t*) = 0). This means that during recall, neuron

*i*should get a positive (negative) input at time

*t*, i.e., Σ

*(*

_{j}w_{ij}x_{j}*t*− 1) > 0 (< 0). This puts a constraint on the weights

*w*. If synaptic strengths

_{ij}*w*exist that respect these constraints for all times

_{ij}*t*and neurons

*i*, the sequence is called linearly separable. There exist many methods to test linear separability (Elizondo, 2006).

Is the proposed learning rule capable of finding the weights for a linearly separable sequence? The answer is yes. First, since we know that *w** exists, the smallest possible divergence from target to model distribution is zero (*P**(*x*)‖*P _{w}*

_{*}(

*x*)) = 0. Second, it can be shown that this divergence is convex in the synaptic weights (Barber, 2011). Therefore a suitable stochastic gradient descent will lead to a perfect solution with (

*P**(

*x*)‖

*P*(

_{w}*x*)) = 0.

We call a deterministic sequence Markovian if any state in the sequence depends only on the previous state, i.e., *x*(*t*) = *f*(*x*(*t* − 1)), for some function *f* and hence *x*(*t*) = *x*(*t*′) ⇒ *x*(*t* + 1) = *x*(*t′* + 1), ∀*t*, *t′*. Note that all linearly separable sequences are Markovian: a non-Markovian sequence contains transitions with the same initial state but different final states, i.e., for some *t* and *t′* it contains the subsequences … *x*(*t*)*x*(*t* + 1) … and … *x*(*t′*)*x*(*t*′ + 1) … with *x*(*t*) = *x*(*t′*) but *x*(*t* + 1) ≠ *x*(*t′* + 1), for which there do not exist any synaptic weights that satisfy the linear separability constraints, since for a neuron *i* for which, e.g., 1 = *x _{i}*(

*t*+ 1) ≠

*x*(

_{i}*t′*+ 1) = 0 there are no

*w*such that 0 < Σ

_{ij}*(*

_{j}w_{ij}x_{j}*t*) = Σ

*(*

_{j}w_{ij}x_{j}*t′*) < 0; hence there is no non-Markovian sequence that is linearly separable and therefore any linearly separable sequence is Markovian.

Markovian sequences that are not linearly separable require appropriate activity states in a hidden layer (Rigotti et al., 2010) such that the whole sequence of visible and hidden states becomes linearly separable. The minimal number of required hidden neurons depends on the problem. But *N*_{h} = *T* hidden neurons are always sufficient, since it is always possible to find *N*_{h} linearly separable states in *N*_{h} dimensions. Required are synaptic connections from visible to hidden and hidden to visible. In non-Markovian sequences the visible state at time *t* does not only depend on the visible activity in the previous time step *t* − 1 but depends potentially on earlier activity states, i.e., in time steps *t* − 2, …. In this case the activity in the hidden layer can be seen as representing the context in that it carries some information about past visible activity states. Consequently non-Markovian sequences require connections between hidden neurons.

##### Link to the voltage-triplet rule.

In the limit of continuous time, the learning rule for visible synapses can be written as a triplet potentiation term (2 post, 1 pre) and a depression term (1 post, 1 pre):
where *x _{i}*(

*t*) denotes the Dirac spike train of neuron

*i*, ρ′

*(*

_{i}*t*) =

*d*ρ

*(u*)/

*du*|

_{u}_{=}

_{ui}_{(}

_{t}_{)}denotes the derivative of the firing intensity function and the prefactor is defined by

*g*(

_{i}*t*) =

*t*, set the probability of spiking in one time bin to ρ

*(*

_{i}*t*)δ

*t*, thereby reinterpreting ρ

*(*

_{i}*t*) as a spike density function, and get in the limit of vanishing time bin size lim

_{δt→0}

_{+}denotes rectification, i.e., [

*x*]

_{+}=

*x*, if

*x*≥ 0, otherwise [

*x*]

_{+}= 0. The convolution kernels α, β, and γ are exponential decay kernels with time constants τ

_{α}(resp. τ

_{β}, τ

_{γ}), e.g., α(

*s*) = τ

_{α}

^{−1}exp(−

*s*/τ

_{α})Θ (

*s*) where Θ(

*s*) denotes the Heaviside function, i.e., Θ(

*s*) = 1 for

*s*≥ 0 and Θ(

*s*) = 0 otherwise.

The correspondence between our learning rule for visible synapses of Equation 13 and the voltage-triplet rule (Clopath et al., 2010) in Equation 14 becomes tighter under the following observations and assumptions. First, it should be noted that in the voltage-triplet model, the threshold is high such that [*u _{i}*(

*t*) − θ

_{2}]

_{+}is nonzero only at the timing of the spike. Therefore this term can be replaced by our Dirac spike train

*x*(

_{i}*t*). Second, we can easily assume that the response kernel ε(

*s*) matches the γ(

*s*) convolution kernel. Indeed in Clopath et al. (2010), the time constant τ

_{γ}(between 15 and 30 ms) was fitted to recordings. Third, in the limit of fast time constants τ

_{α}, τ

_{γ}→ 0 (which is reasonable since those time constants in Clopath et al., 2010 are smaller than 10 ms), the low-pass filtered versions of the membrane potential can be replaced by their instantaneous value:

*u*

_{i}

^{α}→

*u*, (resp.

_{i}*u*

_{i}

^{γ}→

*u*). Finally, if we choose a gain function given by ρ(

_{i}*u*) =

*g*

_{0}[

*u*−θ]

_{+}

^{2}+ ν

_{0}, the factor ρ′(

*u*) = 2

*g*

_{0}[

*u*−θ]

_{+}is a rectified linear function consistent with the voltage triplet model and the factor ρ′(

*u*)/ρ(

*u*) = 2

*g*

_{0}(

*u*− θ)/(

*g*

_{0}(

*u*− θ)

^{2}+ ν

_{0})Θ (

*u*− θ) is close to a rectified linear function for

*u*≪ θ +

*u*< θ and starts linearly (with a slope 2

*g*

_{0}

*v*

_{0}

^{−1}) for

*u*> θ. Note that any rectified polynomial gain function would work as well (with ρ′/ρ differentiable at

*u*= θ for power of the polynomial

*p*≥ 2).

##### Simulation details.

In all learning curve plots the measures were normalized by the number of visible neurons and time steps. If for all time steps and neurons the probability to be active is *P*(*x _{i}*(

*t*) = 1) = 0.5, which is the case for

*u*(

_{i}*t*) = 0 and the sigmoidal spike probability function, the normalized KL divergence from target to model distribution equals one bit minus the normalized entropy of the target distribution. In all simulations the initial weights were chosen to be zero, i.e.,

*w*

_{0}= 0.

In Figure 2*A*, a deterministic, linearly separable target was learned with: number of neurons *N* = *N*_{v} = 10, initial weights *w _{ij}* = 0, learning rate η = 50, parameter β = 0.2, resting potential

*u*

_{0}= 0, learning phase of 1000 target sequence presentations. In Figure 2

*B*the fraction of learnable patterns μ was estimated as the mean of the posterior

*P*(μ |

*D*(

*x*

^{(1)}), …,

*D*(

*x*

^{(l)}) ∝ Π

_{i=1}

^{l}

*P*(

*D*(

*x*

^{(i)}) | μ) with flat prior

*P*(μ) = Uniform([0, 1]), where

*D*(

*x*

^{(i)}) tells whether the sequence

*x*

^{(i)}is learnable or not (

*l*= 100). The target distributions are delta distributions

*P*

^{(}

^{i}^{)}(

*x*) = δ

_{x,x}^{(i)}, where each

*x*

^{(i)}was sampled from a uniform distribution, i.e.,

*x*

^{(i)}(

*t*) ∼

*P*(

*x*

_{j}

^{(i)}(

*t*) = 1) = 0.5. Linear separability, which is the criterion for learnability, was tested with the linear programming method provided by Mathematica (Elizondo, 2006). For the asymmetric Hebb rule,

*w*= 1/

_{ij}*T*Σ

*(2*

_{t}*x*(

_{i}*t*+ 1) − 1)(2

*x*(

_{j}*t*) − 1), we tested, whether these weights lead to perfect recall of the sequence. Note that this measure is more stringent than, for example, the average Hamming distance between target and recalled pattern.

In Figure 2*D* the same procedure as in Figure 2*B* was applied to temporally correlated target patterns. We used patterns with *N*_{v} = 100 neurons and *T* = 5 time steps. To generate patterns with correlation length α, we choose an initial state *x*(0) with *P*(*x _{i}*(0) = 1) = 0.5 and subsequent states

*x*(

*t*) with probabilities

*P*(

*x*(

_{i}*t*) =

*x*(

_{i}*t*− 1) |

*x*(

_{i}*t*− 1)) =

*P*(

*x*(α + 1) =

_{i}*x*(0) |

_{i}*x*(0)) = 0.5. Figure 2

_{i}*E*shows the attempt of learning a non-Markovian target with: number of neurons

*N*=

*N*

_{v}= 10, parameter β = 0.2, resting potential

*u*

_{0}= 0, learning rate η = 0.1; batch algorithm: number of neurons

*N*

_{v}= 10,

*N*

_{h}= 10, parameter β = 0.1, learning rate η = 0.1, estimate of

*r̄*and update of weights every 25 target pattern presentation; on-line algorithm: number of neurons

*N*

_{v}= 10,

*N*

_{h}= 10, learning rate η = 0.5, first low-pass filter time constant γ

_{1}= 1/12, second low-pass filter time constant γ

_{2}= 1/120, the low-pass filter variables

*e*,

_{ij}*r*,

*r̄*were initialized with zeros. During the first 100 pattern presentations no update of the hidden weights took place to omit transient effects of the low-pass filter dynamics. In the batch simulation the state of the hidden neurons was reset to a given state

*h*

_{1}(indicated by the green line) after each target pattern presentation and before each recall. In the on-line simulation the hidden states were never reset. In all simulations the initial weights were set to zero, i.e.,

*w*= 0, the learning phase consisted of 2.5 · 10

_{ij}^{4}target pattern presentations.

Figure 3*A* shows target distribution *P**(*v*) = *P _{w}*

_{*}(

*v*) with

*w**

*drawn from a normal distribution with mean zero and SD 5, number of visible neurons*

_{ij}*N*

_{v}= 5, number of hidden neurons

*N*

_{h}= 0 for target and model with only visible,

*N*

_{h}= 15 for model with hidden neurons, parameter β = 2/

_{v}= 4 · 10

^{−4}, learning rate for hidden weights η

_{h}= 10

^{−4}, resting potential

*u*

_{0}= 0, initial model weights

*w*

_{0}= 0,

*P*(

*x*

_{0}) = δ

_{x0,x*}. Training data consisted of 10

^{7}patterns of length 20 time steps, which were generated by the target distribution by running the neural dynamics with target weights

*w**. The transition frequency data was obtained empirically by generating 10

^{4}samples of 20 time steps.

Figure 3*B* shows the same as in Figure 3*A* but for a different target. The construction of the *k* = 3 Markov target is explained in the second paragraph after Equation 6 and in the caption of Figure 3*B*. We excluded target weights *w**, which parameterize a distribution with highly correlated subsequent visible states, i.e., 〈(2*v _{i}*(

*t*+ 1) − 1)(2

*v*(

_{i}*t*) − 1)〉

_{t,i,Pθ}_{(}

_{x}_{)}> 0.8, since such distributions can be accurately approximated with a Markovian model.

Figure 3*C* shows the KL divergence after learning for different *k*. Results were obtained for 16 different target weight matrices *w** and initial conditions *x**. For simulations with static hidden weights the visible to hidden and hidden to hidden weights were the randomly reordered final weights from the corresponding simulation with plastic hidden weights, the initial visible weights were set to zero and trained with the same target.

In Figures 4 and 5 we identify one simulation time step with 1 ms. Figure 4*A*. Number of neurons *N* = *N*_{v} = 100, response kernel time constants τ_{1} = 10 ms, τ_{2} = 2 ms, adaptation time constant τ* _{r}* = 15 ms, adaptation kernel amplitude

*c*= −10, parameter β = 1/3, learning rate η = 0.5, resting potential

*u*

_{0}= −5, initial model weights zero, target weights

*w**

*= 20, if*

_{ij}*i*∈ {10(

*l*+ 1) + 1, … 10(

*l*+ 2)} and

*j*∈ {10

*l*+ 1, … 10(

*l*+ 1)},

*w**

*= −5, otherwise. Training data consisted of 10*

_{ij}^{6}states generated by the target distribution by running the neural dynamics with target weights

*w**.

Figure 4*B* shows the number of neurons *N* = *N*_{v} = 20, response kernel time constants τ_{1} = 10 ms, τ_{2} = 2 ms, adaptation time constant τ* _{r}* = 3 ms, adaptation kernel amplitude

*c*

_{adapt}= −200, parameter β = 0.1, learning rate η = 100, resting potential

*u*

_{0}= −5, initial model weights zero, delta target distribution

*P**(

*x*) = δ

_{x,x*}, with

*x** drawn from a distribution with

*P*(x*

*(*

_{i}*t*) = 1) = 0.1, the learning phase consisted of 2 · 10

^{4}target pattern presentations. For recall, the system is always initialized in the same way.

Figure 4*C* shows the number of neurons *N* = *N*_{v} = 10, fast response kernel kernel time constants τ_{1} = 15 ms, τ_{2} = 7 ms, slow response kernel time constants τ_{1} = 50 ms, τ_{2} = 30 ms, adaptation kernel κ(*t*) = *c*_{f}^{−t/τf} + *c*_{s}^{−t/τs}, with *c _{f}* = −50,

*c*= −5, τ

_{s}*= 7 ms, τ*

_{f}*= 30 ms, the slow decay with time constant τ*

_{s}*was added to counteract the influence of the slow response kernel and thus increase the stability during recall, parameter β = 0.2, learning rate η = 10,*

_{s}*u*

_{0}= −5, and initial model weights zero. Target spike trains were generated by injecting strong external positive currents in high firing rate phases and negative currents in low firing rate phases, which we modeled using neural dynamics given by

*u*(

_{i}*t*) =

*u*

_{0}+

*x*

_{i}

^{κ}(

*t*) +

*u*

_{ext}, where

*u*

_{ext}= +50(−50) in high (low) firing rate phases of the neuron and the sigmoidal spiking firing probability. Each high (low) firing rate phase had a duration of 50 ms. The training phase consisted of 3 · 10

^{3}samples drawn from the target distribution.

In Figure 5*A* the change in synaptic strength Δ*w* after 50 pairings of one presynaptic with one postsynaptic spike on an interval of 200 ms was computed for different adaption kernels κ(*t*) = *ce*^{−t/τr} with τ* _{r}* = 10 ms (black:

*c*= 2, blue:

*c*= 0, red:

*c*= −10) while keeping the synaptic response kernel ε(

*t*) =

*e*

^{−t/τ1}−

*e*

^{−t/τ2}), with τ

_{1}= 10 ms and τ

_{2}= 2 ms. The initial value of the membrane potential before stimulation was set to the resting potential

*u*

_{0}= −5. Initial weight was

*w*= 10. In Figure 5

*B*the data were fitted with a model given by the probability function ρ(

*u*) =

*g*

_{0}[

*u*− θ]

_{+}

^{2}+ ν

_{0}, and response kernel α(

*s*) = τ

_{α}

^{−1}exp(−

*s*/τ

_{α})Θ(

*s*) (see above, Link to the voltage-triplet rule). The fitted parameters are given by the following:

*g*

_{0}= 0.94 · 10

^{−2}Hz, ν

_{0}= 7.9 Hz, θ = −36.2, τ

_{α}= 41.5 ms,

*w*

_{0}= 1, and η = 0.46.

All simulations were performed with Mathematica on personal computers, except the simulations in Figure 3, which were written in C and the fit in Figure 5*B*, which was performed with MATLAB.

## Results

We studied the task of learning to spontaneously produce spiking activity with a given statistics: stimuli make neurons fire in a specific order (Fig. 1*A*, target); in the absence of any stimulus, neurons spike spontaneously; and before learning, the spontaneous activity patterns do not resemble the stimulus-evoked activity patterns (Fig. 1*A*, before learning). The goal of learning is to change the network dynamics such that after learning the spontaneous activity patterns resemble the stimulus-evoked activity patterns (Fig. 1*A*, after learning). To derive plasticity rules that solve this learning task, we chose the neural dynamics and minimized with respect to synaptic weights a distance measure between stimulus-evoked and spontaneous activity (see Materials and Methods; Fig. 1*B*). In this way, the learning rule is matched to the neural dynamics.

To study the proposed learning rules we first elaborate on deterministic target sequences learned with the simplest variant of our model. Deterministic target sequences are of behavioral relevance, since spatiotemporal spike pattern distributions are presumably often sharply peaked in the sense that the typical spike patterns closely resemble each other.

### Deterministic target sequences

In the simplest conceivable form of our model the shape of the postsynaptic kernel ε is a unit rectangular for one delayed time step and no adaptation takes place. This stochastic McCulloch-Pitts neuron (McCulloch and Pitts, 1943) is of widespread usage in artificial neural networks (Baum and Wilczek, 1987; Düring et al., 1998; Barber, 2011).

In Figure 2*A*, target, we show a sequence, which is learnable with only visible neurons. During learning, the stimuli activated the neurons repetitively in the order shown in the figure. After learning, spontaneous activity reproduced the target sequence including the transition from the last to the first spatial activity pattern. This fact is also reflected in the learning curve in Figure 2*A*, which shows that the model distribution approximated the target distribution almost perfectly.

Which sequences are learnable with only visible neurons and which require hidden neurons? Since at each moment in time a neuron can be either active or silent, the total number of visible sequences is 2^{N}_{v}* ^{T}*, for a given number of visible neurons

*N*

_{v}and sequence length

*T*. We can group these sequences using as criterion either linear separability or Markovianity (see Materials and Methods). It turns out that this grouping helps to formulate minimal architectural requirements, as summarized in Figure 2

*C*: linearly separable sequences can be learned with only visible neurons and synaptic connections between them, Markovian sequences require enough (at most

*N*

_{h}=

*T*) hidden neurons and at least synaptic connections from visible to hidden and hidden to visible, and non-Markovian sequences demand additionally connections between hidden neurons.

The sequence shown in Figure 2*A* is linearly separable and thus can be learned with only visible neurons. The sequence in Figure 2*E* is non-Markovian: the state marked by a blue triangle occurs twice in the sequence. Furthermore, already the second state (red triangle), where all neurons are silent, renders the sequences not linearly separable. This is a consequence of our choice of coding (1, active; 0, silent) and adaptable parameters (only synaptic weights *w _{ij}*), which we used to account for biological constraints. This sequence can only be learned with appropriate recurrent connections of the hidden neurons. We show that both the batch and the on-line learning rule find appropriate synaptic weights by stochastic gradient descent on the upper bound of the divergence from target to model distribution. This bound becomes tight, i.e., = , toward the end of learning (see Materials and Methods), which is reflected in the diminishing distance between the and curves in Figure 2

*E*.

What is the typical size of the subset of linearly separable, and thus learnable, sequences for given *N*_{v} and *T*? In Figure 2*B* we show the fraction of learnable sequences as a function of the relative sequence length *T*/*N*_{v} for different *N*_{v}. Below a relative sequence length of approximately *T*/*N*_{v} ≈ 1.5 the probability for the sequence to be linearly separable is very close to 1. The critical value *T*/*N*_{v} ≈ 1.5 is again a consequence of our choice of coding. For (1, active; −1, silent) coding we expect the critical relative sequence length to be at *T*/*N*_{v} ≈ 2 (Hertz et al., 1991). As a reference we compare this to a simple temporal Hebb rule Δ*w _{ij}* =

_{t=1}

^{T}(2

*x*(

_{i}*t*+ 1) − 1)(2

*x*(

_{j}*t*) − 1) as proposed in Hopfield (1982); Düring et al. (1998); Grossberg (1969); Herz et al. (1988); Gerstner et al. (1993); Horn et al. (2000). It is not surprising that this simple temporal Hebb rule, for which linear separability is not a sufficient condition for a pattern to be learnable, in general does not find synaptic weights to perform perfect recall: instead of solving an optimization problem (minimizing the divergence from target to model distribution), which leads to a learning rule that is matched to the dynamics, it is simply inspired by the symmetric Hebb rule used in Hopfield networks (Hopfield, 1982). Its weakness becomes even clearer when we consider correlated patterns (Fig. 2

*D*): the simple temporal Hebb rule performs well only for uncorrelated patterns whereas the probability of linearly separable sequences decreases slowly with increasing correlation.

### Stochastic target

Some stimuli might evoke spike patterns that do not follow a sharply peaked distribution. Instead, the spike patterns look quite different each time a sample from the target distribution *P**(*x*) is drawn. Even though they do not look similar, there might be some temporal dependencies in the sense that the spiking activity *x*(*t*) at time *t* depends on earlier spiking activity states *x*(*t* − *k*), *k* > 0. Variability in the target patterns can arise due to many different reasons. First, there is an extrinsic source of variability, as the same stimulus can be presented with small variations. Second, neurons and synapses are intrinsically noisy, so the neural network that provides the teaching signal will be subject to noise. Interestingly, in the proposed framework the level of noise during recall can be matched to the target variability through synaptic plasticity.

As we showed in Materials and Methods (second paragraph after Eq. 6) it is possible to implement a large class of stochastic target distributions, namely factorizable Markov chains product *P _{w}*(

*v*(

_{i}*t*) |

*v*(

*t*− 1), …,

*v*(

*t*−

*k*)) of order

*k*, which can be parameterized by

*P*(

_{w}*v*(

_{i}*t*) |

*v*(

*t*− 1), …,

*v*(

*t*−

*k*)) (cf., Eqs. 1 and 2). For Markovian targets (

*k*= 1) and a model without hidden neurons, the solution is unique due to the convexity of the KL divergence. In Figure 3

*A*we demonstrate in an example that the learning rule effectively leads to the solution. For

*k*> 1 a model with only visible neurons fails to learn the target, whereas a model with sufficient hidden neurons equipped with our learning rule succeeds (Fig. 3

*B*). Under the light of the remarkable capabilities of reservoir computing (Maass and Sontag, 2000; Lukoševičius and Jaeger, 2009; Rigotti et al., 2010) it is interesting to note that a random choice of static hidden weights obtained by reshuffling learned weights is not sufficient to learn the target (Fig. 3

*C*).

### α-shaped kernels

So far we discussed the plasticity rule for a model with very simple dynamics. However, the neuron model in Equation 1 allows for more realistic dynamics. In the examples shown in Figure 4, a presynaptic spike evokes an α-shaped synaptic response ε (Fig. 4*B*), which is felt by the postsynaptic neuron, and an adaptation κ, which prevents immediate follow-up spikes in the presynaptic cell. With this dynamics it is possible to learn a “synfire chain” (Fig. 4*A*) or a pattern of precise spike times (Fig. 4*B*). Depending on the target distribution and the neural dynamics, timing errors may propagate and the pattern eventually becomes unstable (Diesmann et al., 1999). This fading of precision is less problematic with a rate code.

In Figure 4*C*, target, the pattern of Figure 2*A* is encoded with periods of high and low firing rate. The system can reasonably well learn this target with an α-shaped synaptic response ε_{1} and adaptation kernel κ (Fig. 4*C*, model with only ε_{1}, blue learning curve). Fast synaptic responses ε_{2} help to further stabilize the pattern and increase performance (Fig. 4*C*, model with both ε_{1} and ε_{2}, red learning curve). The motivation for using synaptic responses on two different timescales arises from the idea that fast connections establish attractors that encode for the different elements in the sequence and slow responses push the dynamics from one attractor to the next (Sompolinsky and Kanter, 1986). Neurons could implement fast and slow responses to one presynaptic spike with different types of synapses, e.g., fast AMPA synapses and slow NMDA synapses.

### Biological plausibility

When learning is formulated with a computational goal in mind, like minimizing the difference between target and model distribution, it is far from obvious that the resulting learning rule is biologically plausible in the sense that it respects constraints of biological systems and is consistent with experimental data. Here we argue that the proposed learning rule in Equation 12 is biologically plausible.

The learning rule in Equation 12 respects causality, is implemented in an on-line fashion, and depends on presynaptic and postsynaptic *x* activity and a modulatory factor for hidden synapses. The pre-post term shares similarities with STDP: Figure 5*A* shows the STDP window predicted by the learning rule (Bi and Poo, 1998; Pfister et al., 2006; Brea et al., 2011). Note that this learning rule, which minimizes the divergence in Equation 3 from target to model, predicts that the causal part of the STDP window should be matched to the dynamics of the synaptic transmission. The acausal part depends on the adaptation properties of the neuron. Interestingly, the learning rule is closely related to the voltage-triplet rule (Pfister and Gerstner, 2006; Clopath et al., 2010; see Materials and Methods) and is compatible with the frequency dependence of STDP as observed by Sjöström et al. (2001) (Fig. 5*B*).

For hidden synapses, the plasticity rule does not depend only on the presynaptic and postsynaptic activity, but is also modulated by a global factor. This global factor could be implemented by astrocytes for mainly three reasons. First, it has been shown that astrocytes, which contact a large number of synapses, can modulate synaptic plasticity (Henneberger et al., 2010; Min and Nevian, 2012). Similarly, in our framework, the semi-global modulating factor (see Eq. 12) affects the plasticity of a large group of synapses (the hidden synapses). Second, according to our model, this global factor (see Eq. 12) acts at a slower time constant than the membrane time constant, which is consistent with the calcium dynamics of astrocytes (Di Castro et al., 2011). Finally, a causal pairing (pre-then-post) can lead either to long-term potentiation or long-term depression depending on the sign of the global factor (*r*(*t*) − *r̄*(*t*)). This is in agreement with the study of Panatier et al. (2006) who showed that when d-serine, which is an NMDA co-agonist, is released by astrocytes, long-term depression can be turned into long-term potentiation. Unlike what is suggested by Panatier et al. (2006), this type of sliding threshold (*r̄* (*t*) in our case) is conceptually different from the one in the BCM learning rule (Bienenstock et al., 1982) since it is a global quantity whereas in the BCM learning rule the sliding threshold is a local quantity that depends on the history of the postsynaptic activity. Note that the proposed global factor depends on activity in visible neurons and affects hidden synapses only. Thus, if astrocytes are responsible for this signaling, they “need to know” which neurons are visible and which are hidden. This might be possible due to geometrical arrangement or chemical signaling.

## Discussion

Learning a statistical model for temporal sequences with hidden units is a challenging machine learning task in itself. Here we considered an even harder problem in that we are interested in a biologically plausible learning rule that can solve this task. To be biologically plausible, the learning rule has to be local, causal, and be consistent with experimental data. We derived a biologically plausible learning rule that minimizes by stochastic gradient descent an upper bound of the KL divergence from target to model distribution.

Because the proposed learning rule is minimizing an upper bound of the KL divergence and not the KL divergence itself and because the tightness of the bound is not explicitly controlled, unlike in the Helmholtz framework (Dayan et al., 1995), the maxima of the two different objective functions could in principle be located at different places in the parameter space. It is, however, interesting to note that after some learning time, the bound becomes tighter, as can be seen on Figure 2*E* where the KL divergence between the proposal distribution *Q _{w}*(

*h*|

*v*) and the posterior distribution

*P*(

_{w}*h*|

*v*) decreases. In fact, we can show that at the beginning of learning and at the end of learning of deterministic patterns, the bound is tight. Indeed, at the beginning of the learning all the weights (and in particular the weights toward visible neurons) are initialized to zero and therefore, the function

*R*(

_{w}*v*|

*h*) is independent of the hidden activity and thus and are identical. At the end of learning, for a deterministic pattern, the proposal distribution is a Dirac delta distribution and therefore = .

It is interesting to note that the learning rule for the hidden neurons in Equation 12 appears formally similar to a reward-based learning rule (Pfister et al., 2006), but now the reward is an internally computed quantity and does not depend on an external reward. Loosely speaking, if the hidden units contribute to make the visible spikes likely, the synapses targeting those hidden units receive an internal reward signal (*r*(*t*) − *r̄*(*t*)) (Eq. 12).

The formulation of learning as minimizing the KL divergence (or, equivalently, maximizing the log-likelihood) or an upper bound thereof is common practice in machine learning. The novelty of our approach lies in the specific choice of the upper bound of the KL divergence, which relies on the assumption of conditional independence for the neural firing, i.e., given its membrane potential, the probability of firing of a neuron is conditionally independent of the firing of all other neurons at the same time. Even though this assumption is perfectly reasonable and widely used (e.g., in the GLM framework; Pillow and Latham, 2008), it is the key assumption that allows the joint distribution over visible and hidden activity to be expressed as a product of a distribution from which it is easy to sample (*Q _{w}*(

*h*|

*v*)) and a function which is easy to calculate (

*R*(

_{w}*v*|

*h*)). This stands in contrast to temporal Boltzmann machines (Sutskever et al., 2009) where this assumption is not made and sampling usually involves running a Monte Carlo Markov Chain in each time step, which is hard to justify under the light of biological plausibility.

Another approach to learn a statistical model of spatiotemporal spike pattern with visible and hidden neurons is a generalization of the expectation-maximization algorithm proposed by Pillow and Latham (2008). Yet, as a version of the Helmholtz machine, the distribution, from which the hidden states are sampled, uses a different parameterization for storage (recognition; wake phase of learning) and recall (generation; sleep phase of learning) and assumes an acausal kernel, which renders the model unsuitable for a biologically realistic implementation. To circumvent the need of sampling the hidden states, *h* ∼ *P _{w}*(

*h*|

*v*) (see Eq. 4), Rezende et al. (2011) proposed to calculate explicitly the expectation under the posterior distribution

*P*(

_{w}*h*|

*v*). To achieve this, they had to assume a weak coupling between neurons and then approximate the true posterior distribution by a Gaussian process on the membrane potential. This weak coupling assumption is however difficult to justify at the end of learning where individual weights can become large.

A drawback of the proposed learning rule for systems including hidden neurons is the potentially long learning time. The plasticity rule relies on stochastic gradient ascent by sampling hidden sequences *h* ∼ *Q _{w}*(

*h*|

*v*). As in any gradient descent method, the learning time depends on the learning rate and the initial condition: the learning time is short if at the beginning of learning the weights projecting onto hidden neurons are such that the samples

*h*∼

*Q*(

_{w}*h*|

*v*) help to quickly reduce the difference measure between target and model distribution and can be long otherwise. In some cases it is even possible to find initial weights that supersede any further learning of hidden weights. For example, any possible sequence as discussed in Figure 2

*C*could be learned with a (hardwired) delay line of length

*T*with

*N*

_{h}=

*T*hidden neurons and no learning of the hidden weights. Alternatively, the delay line could be implemented by a large enough number of randomly connected hidden neurons (Maass and Sontag, 2000; Lukoševičius and Jaeger, 2009; Rigotti et al., 2010) where the weights are chosen from a given distribution. The number of randomly connected hidden neurons needed might be very large to guarantee good solutions in any case. Since the goal of this paper was rather to demonstrate that the learning rule is capable of learning distributions that could not be learned with only visible neurons, we did not make use of elaborated choices of the initial conditions and started all simulations with initial weights

*w*

_{0}= 0.

Our paper is not the first one to propose the idea that storage and recall should be matched. Indeed Sommer and Dayan (1998) and Lengyel et al. (2005) already proposed that there should be a tight link between the plasticity rule and the neural dynamics. Interestingly their approach is complementary to the one presented here. Indeed, they start from a given plasticity rule and then derive the optimal (Bayesian) recall dynamics for this given plasticity rule. Here we are following the opposite path. We start from the neural dynamics and then derive the plasticity rule. Given the richness and the accuracy of existing neural models and given the absence of a canonical model of synaptic plasticity, we preferred to start from what is well known and derive predictions on what is largely unknown.

An interesting outcome of our model is that the learning rule for hidden synapses does not depend only on the presynaptic and postsynaptic activity, but is also modulated by a global factor. We argued in this paper that this global factor could be provided by astrocytes. To experimentally test this hypothesis we note that the global factor is predicted to depend on the voltage of the visible neurons. In particular, independently of the precise implementation of the model (be it by minimizing an upper bound on the KL divergence as presented here or by directly minimizing KL divergence itself as in Brea et al., 2011), the global factor crucially depends on the presynaptic membrane potential at the time of the spike (see Eq. 10). So the key experimental step would be to show that astrocytic activity depends on the membrane potential of the presynaptic neuron at the time of the spike. This prediction seems plausible since it was found that a depolarization of the presynaptic membrane potential at the time of the spike causes a larger postsynaptic potential (Alle and Geiger, 2006; Shu et al., 2006). We suggest that astrocytes could have a similar sensitivity to the membrane potential at the time of the spike.

## Footnotes

This research is supported by the Swiss National Science Foundation (J.B. and W.S., Grant 31003A_133094; J.-P.P., Grant PZ00P3_137200) and a grant from the Swiss SystemsX.ch initiative (Neurochoice). We gratefully acknowledge discussions with Robert Urbanczik and Danilo Jimenez-Rezende.

- Correspondence should be addressed to Johanni Brea, Department of Physiology, University of Bern, Bühlplatz 5, CH-3012 Bern, Switzerland. johannibrea{at}gmail.com