# Estimating Allele Age and Selection Coefficient from Time-Serial Data

- Anna-Sapfo Malaspinas*,
^{1}, - Orestis Malaspinas
^{†}^{‡}, - Steven N. Evans
^{§}and - Montgomery Slatkin**

^{*}Centre for Geogenetics, Natural History Museum of Denmark, University of Copenhagen, 1350 Copenhagen, Denmark^{†}Institut Jean le Rond d’Alembert, Université Pierre et Marie Curie, F-75252 Paris cedex 5, France^{‡}Computational Science Department, Centre Universitaire d’Informatique, Université de Genève, 1227 Carouge, Switzerland^{§}Department of Statistics, University of California, Berkeley California 94720-3860, and^{**}Department of Integrative Biology, University of California, Berkeley, California 94720-3140

- 1Corresponding author: Centre for Geogenetics, Natural History Museum of Denmark, Øster Voldgade 5-7, 1350 Copenhagen K, Denmark. E-mail: anna.sapfo.malaspinas{at}snm.ku.dk

## Abstract

Recent advances in sequencing technologies have made available an ever-increasing amount of ancient genomic data. In particular, it is now possible to target specific single nucleotide polymorphisms in several samples at different time points. Such time-series data are also available in the context of experimental or viral evolution. Time-series data should allow for a more precise inference of population genetic parameters and to test hypotheses about the recent action of natural selection. In this manuscript, we develop a likelihood method to jointly estimate the selection coefficient and the age of an allele from time-serial data. Our method can be used for allele frequencies sampled from a single diallelic locus. The transition probabilities are calculated by approximating the standard diffusion equation of the Wright–Fisher model with a one-step process. We show that our method produces unbiased estimates. The accuracy of the method is tested via simulations. Finally, the utility of the method is illustrated with an application to several loci encoding coat color in horses, a pattern that has previously been linked with domestication. Importantly, given our ability to estimate the age of the allele, it is possible to gain traction on the important problem of distinguishing selection on new mutations from selection on standing variation. In this coat color example for instance, we estimate the age of this allele, which is found to predate domestication.

- iallele age
- ancient DNA
- population genetics
- selection
- time-serial data

TIME-series analysis is widespread in several fields, such as meteorology, economics, and physics (Hamilton 1994) with the relation being statistical models designed to deal with a time-ordered sequence of observations. Such observations are also prevalent in several areas of biology. Until recently, however, time-series molecular data were only available for time spanning a few generations in higher organisms. Therefore, in the context of population genetics, time-serial data were mostly limited to viral or experimental evolution (Wichman *et al.* 2005; Bollback and Huelsenbeck 2007; Nelson and Holmes 2007; Gresham *et al.* 2008).

However, with recent advances in DNA sequencing and DNA preparation techniques, the study of extinct and long dead organisms is now entering a new era, an era in which time-sampled measurements spanning hundreds or thousands of generations for even mammalian species may be obtained. For example, while previous studies were limited to short segments of mitochondrial DNA, whole nuclear genomes are now available from several ancient samples (Rasmussen *et al.* 2010; Reich *et al.* 2010), and it is now additionally possible to target specific DNA regions in ancient organisms (Lalueza-Fox *et al.* 2007; Ludwig *et al.* 2009; Rusk 2009). Therefore, time-serial data will become increasingly available for a whole range of organisms allowing one to test evolutionary questions using not only present day samples, but also samples from extinct populations.

The relevant theory to describe such temporal changes in allele frequency has existed since the advent of population genetics (Fisher 1922; Wright 1931). Although not very common, several statistical methods and estimators to deal with time-serial data have been developed and applied to, for example, estimate historical changes in population size (Waples 1989; Williamson and Slatkin 1999; Anderson *et al.* 2000; Drummond and Rambaut 2007). More recently, in 2008, Bollback *et al.* developed a method to coestimate the effective population size, *N*_{e}, and the selection coefficient, *s*, from temporal allele frequency data. They model the evolution of the allele frequency of a diallelic locus with a diffusion process that approximates a Wright–Fisher population genetic model (WF), under the assumption that the locus is under constant natural selection acting on diploid individuals.

Our work is a natural extension of Bollback *et al.*’s (2008) method to also allow for the estimation of the allele age (*i.e.*, the time since the mutational event), *t*_{0}. Allele age is an omnipresent parameter in population genetics and along with the selection coefficient it plays a crucial role in determining the sojourn time of a beneficial mutation (see Slatkin and Rannala 2000 for a review). Additionally, given the recent focus on the important question of distinguishing between models of selection on new *vs.* standing mutations—a phenomenon that speaks to the fundamental mode and tempo of the process of adaptation—the ability to estimate the time of a mutational event is of paramount importance (see review of Barrett and Schluter 2008).

Our extension allows these competing models to be addressed, unlike the previous work of Bollback *et al.* (2008) that assumed that at the first time of sampling the population allele frequency was uniformly distributed. It follows from this latter assumption that even if the allele was not sampled at the oldest sampling time, it had to be present in the population. Here, we present an approach to coestimate *s*, *N*_{e}, and *t*_{0} by computing the likelihood of these parameters for a suitable model.

In the *Theory* section we explain how we approximate the WF model with a one-step process. We then discuss the numerical details of the implementation in both *Numerics* sections. We show how our method performs on the basis of simulations in both S*imulations* sections. We analyze a data set of horses for the *ASIP* locus for samples dating from the Pleistocene up to the present in both *Real data* sections. We conclude and offer some further perspectives in *Conclusion*.

## Materials and Methods

### Theory

We assume that there is a single, panmictic population evolving according to a WF population genetic model. Under this model, the frequency of an allele *A* is a homogeneous discrete-time Markov chain. We denote the Markov chain describing the frequency of the allele *A* through time by *X*_{t}. We assume that selection is constant from the time the allele arose up to present. The allele under selection arises only once and there is no recurrent mutation. In other words, the only evolutionary forces acting on that allele are genetic drift and selection.

Selection is modeled as acting on diploid individuals. If we denote the two alleles by *A* and *a*, we can choose the genotypic fitness to be *w _{AA}* = 1 +

*s*,

*w*= 1 +

_{Aa}*sh*, and

*w*= 1, where

_{aa}*s*is the selection coefficient and

*h*is the dominance coefficient (

*s*> −1 and

*h*ε [0, 1]; see,

*e.g.*, Ewens 2004). If

*N*

_{e}is the effective population size, the states of

*X*

_{t}are the allelic frequencies, with respect to the population size

*j*≤ 2

*N*

_{e}. Therefore, the state space is

*γ*= 2

*N*

_{e}

*s*.

We compute the likelihood of the allele age *t*_{0}, the rescaled selection coefficient *γ*, and the effective population size *N*_{e}. To simplify the notation, define *θ* ≡ (*γ*, *N*_{e}, *t*_{0}) the parameters of interest. Assume that we have samples from *m* distinct sampling time points. We suppose that *M* = (*n*_{1}, *n*_{2}, …, *n _{m}*) chromosomes were collected, among which

*I*= (

*i*

_{1},

*i*

_{2}, …,

*i*) are of the

_{m}*A*type, and that the chromosomes were drawn at times

*T*= (

*t*

_{1},

*t*

_{2}, …,

*t*), where time is measured in generations with

_{m}*t*

_{k}_{−1}<

*t*(see Figure 1). The likelihood function of the parameters, for a given

_{k}*M*and

*h*, is ℓ(

*θ*) =

*p*(

*i*

_{1}, …,

*i*|

_{m}*θ*,

*T*).

To compute the likelihood, we can condition and sum over all the population allelic frequencies, *x _{j}*

_{1}, …,

*x*, at each sampling time

_{jm}*t*

_{1},

*t*

_{2}, …,

*t*. We can then rewrite the likelihood:

_{m}*A*alleles

*i*at each sampling time are independent of one another. The first term of the summation of Equation 1 becomes

_{j}*k*ε {0, ..,

*m*}

*X*is a Markov chain, the second term of the summation of Equation 1 is given by

_{t}*x*

_{j}_{0}is the frequency of the allele when it first arose in the population,

*i.e.*,

*X*

_{t}

*θ*and

*T*. By substituting Equation 2 and 4 into 1 we obtain

The solution for the transition probabilities for the nonneutral case of the WF model is elaborate (Ewens 2004 and citations therein, but see also Song and Steinrucken 2012). Rescaling the time by 2*N*_{e}, the Markov chain, *X*_{t}, can be approximated by a diffusion process (“WF diffusion process”), *Y _{τ}* (see,

*e.g.*, Durrett 2008). Time is now in units of 2

*N*

_{e}generations and is continuous and we replace

*T*by

*y*= (

*τ*

_{1}, …,

*τ*), where

_{m}*y*ε [0, 1]. This holds in the limit of large

*N*

_{e}, where

*Z*(see,

_{τ}*e.g.*, Van Kampen 1992). A one-step process is a continuous-time Markov chain (

*i.e.*, discrete in space and continuous in time) where jumps are only allowed between two states that are adjacent to each other. As before, the states of the process

*Z*are certain population allelic frequencies that we denote by {

_{τ}*z*

_{0},

*z*

_{1}, …,

*z*

_{H}_{−1}}, where

*H*is an integer. The states are chosen such that

*z*

_{0}and

*z*

_{H}_{−1}are, respectively, the 0 and 1 allelic frequencies. These are absorbing states since there is no recurrent mutation. The other states are chosen such that 0 <

*z*< 1 and

_{k}*z*

_{k}_{−1}<

*z*for 0 <

_{k}*k*<

*H*− 1. The infinitesimal generator

*Q*of such a process is a tridiagonal

*H*×

*H*matrix. By denoting

*β*(respectively,

_{i}*δ*) the rate of jumping to the right (respectively, the left) of state

_{i}*i*, we have that

*η*= −(

_{k}*β*+

_{k}*δ*). The transition probability between two states

_{k}*z*of the process is

_{jk}*β*and

_{i}*δ*(see Supporting Information, File S1.1), one can show that for large

_{i}*H*,

*Z*≃

_{τ}*Y*. In particular,

_{τ}*β*and

_{i}*δ*will be functions of

_{i}*z*,

_{j}*z*

_{j}_{−1},

*z*

_{j}_{+1},

*γ*and

*h*. Note that

*Y*is a continuous variable whereas

_{τ}*Z*is discrete. Therefore, choosing

_{τ}*Y*has a continuous-state space and

_{τ}*Z*has a discrete-state space. We can approximate the likelihood described in Equation 5 by replacing the original process

_{τ}*X*

_{t}by the one-step process

*Z*. We then have

_{τ}In the case of experimental evolution this unconditional process should be realistic since in principle one might want to estimate the selection coefficient for any locus. We now consider one special case of what is presented above, motivated by ancient DNA data. We assume that the allele is segregating at the last sampling time (*i.e.*, the process has not reached states 0 or 1). This case corresponds to what we think is a realistic scenario for how ancient DNA data would be collected, where presumably the locus of interest is polymorphic at present. Indeed, only such loci would be selected for inference.

We can rewrite the resulting likelihood as*z*_{1}, …, *z _{H}*

_{−2}} ⊂ {

*z*

_{0},

*z*

_{1}…

*z*

_{H}_{−2},

*z*

_{H}_{−1}}. The infinitesimal generator

*q*of such a process is the matrix

^{C}*Q*without the first and last rows and columns,

*i.e.*,

Finally, to compute the likelihood of Equations 8 and 9, all that remains is to compute the matrix exponentiation *e ^{Qτ}* and

### Numerics

We evaluate numerically the matrix exponentiation. The advantage of the current approach compared to Bollback *et al.*’s is that we do not need to do a numerical integration step since the state space is already finite. The description of the matrix exponentiation is given in File S1.2.

Although asymptotically the one-step process is equivalent to the WF model, since the state space of *Z _{τ}* has a finite number of states, the accuracy of the approximation depends on the choice of the states, or what we call from now on the “grid.” We investigate three grids strongly inspired by Gutenkunst

*et al.*(2009). The first one is a uniform grid with a point added at

Since the likelihood function is complex, we were not able to compute the maximum of the function analytically. Therefore, to find the maximum, we first computed the likelihood over a large range of parameters. We verified that there is a single maximum for each time interval defined by adjacent sampling times, *i.e.*, if *t*_{0} < *t*_{1}, the time intervals are (−∞, *t*_{0}), (*t*_{1}, *t*_{2}), …, (*t _{m}*

_{−1},

*t*), and the likelihood surface is smooth. We used the

_{m}*SciPy*(Jones

*et al.*2001) implementation of the Nelder–Mead simplex algorithm (Nelder and Mead 1965) to find the maximum for each time interval.

Our implementation is written in *Python* and *C++* making use of the *Numpy* (Oliphant 2006), *SciPy*, and *mpack* (Nakata 2010) libraries for computations and of the *Matplotlib* library (Hunter 2007) for plotting and is available upon request.

### Simulations

To test our model, we simulate several data sets with the WF model forward in time. Simulating the WF model can be time consuming if the population size is large, so we picked a small population size (*N*_{e} = 500). In principle, however, the conclusions hold for higher population size. We then infer the maximum-likelihood estimates (MLEs) using our one-step method. We use two different sampling schemes. The first one is similar to the real data set we analyze below, *i.e.*, 6 sampling times each with 50 chromosomes. The second one corresponds to having twice as many sampling times with half the number of chromosomes, *i.e.*, 12 sampling times and 25 chromosomes. We searched for the MLEs across a finite domain, *i.e.*, *N*_{e} ε [100, 1000], *t*_{0} ε [−3000, 0], and *γ* ε [−200, 200]. We assess the accuracy of our estimator and compare the sampling schemes by looking at the bias of the estimates and the root mean square error (RMSE).

### Real data

In 2009, Ludwig *et al.* sequenced several loci encoding coat color in horses. Each locus has been shown to be linked with a color phenotype in present day horses. In other words, the phenotype associated with each locus is segregating in present populations. We reanalyze in this article one of the loci encoding for the agouti-signaling protein (ASIP), that controls the distribution of the black pigment (Rieder *et al.* 2001). We investigate the hypothesis that at the beginning of domestication some coat colors in horse were positively selected for.

The samples sequenced were obtained from Siberia, Middle and Eastern Europe, China, and the Iberian Peninsula. As in Ludwig *et al.* (2009), we group the samples into six sampling times, *t*_{1} ≃ 2000, *t*_{2} ≃ 13100, *t*_{3} ≃ 3700, *t*_{4} ≃ 2800, *t*_{5} ≃ 1100, and t_{6} ≃ 500 years BC. We assume that the generation time of horses is 5 years, following Ludwig *et al.* (2009). The wild-type horses are presumed to have been of bay color. The mutation of interest for the *ASIP* locus is recessive, since only horses homozygous for the *ASIP* locus will be black. We test for directional selection and set *h* = 0.

To compute a possible range for the population sizes we use data from Cieslak *et al.* (2010). They sequenced part of the control region of the mtDNA for 78 samples that are part of Ludwig *et al.* (2009)’s data set. The control region of the mtDNA is a noncoding region. One way to compute the population size *N*_{e} is to compute the diversity *π* of the samples. Then, assuming the region is neutral and ignoring hitchhiking effects due to nearby selected sites, we use the relationship that relates the diversity of a sample to the population size, *μ* is the mutation rate per base pair per generation. To obtain an estimate of the mean and standard error of *π* of the mtDNA sample, we use the maximum-likelihood method implemented in *MEGA* (Tamura *et al.* 2011) with default parameters. We approximate the standard error for the diversity by performing 1000 bootstraps. We use Jazin *et al.* (1998)’s estimate for the mutation rate (*i.e.*, *μ* ε (3.0 × 10^{−6}, 4.4 × 10^{−5})). Jazin *et al.* (1998) used human families to obtain direct estimates of the mutation rate for mtDNA control region for a single generation. Although the mutation rate is an important parameter, we do not have direct estimate in horses and we have to rely on results for other species. To obtain conservative upper/lower bounds for *N*_{e} we use the 95% confidence interval (CI) bounds of the mutation rate and the diversity. If the CIs for *μ* and *π* are denoted (*μ*_{low}, *μ*_{up}) and (*π*_{low}, *π*_{up}) respectively, we defined*N*_{e} ε [200, 5000], *t*_{0} ε [−10000, 0], and *γ* ε [−200, 200] for the parameters. We fix *H* = 400 for this computation.

For the CIs, several asymptotic results that apply for maximum likelihood exist, especially for a time-serial Markov chain. Our sample sizes are generally small, however, so we chose to compute the CIs with a parametric bootstrap approach.

Note that several assumptions of our model are violated with this data set, such as constant population size and potentially random mating (since the samples are taken from all around the world); moreover, the *MC*1*R* locus, encoding a melanocortin receptor and related to the black pigment production, is known to have an epistatic interaction with *ASIP* (Rieder *et al.* 2001). Nevertheless, we analyze these data as described above to have a more direct comparison between our results with those obtained with Bollback *et al.*’s method on the same data set.

## Results and Discussion

### Numerics

To validate the method, we compared several known analytical results for the WF model with the one-step process. For the neutral case, it is possible to compute the likelihood since the transition probabilities are known for the diffusion process (see, *e.g.*, Ewens 2004). As can be seen in Figure S3 (see File S1), even for a grid of size 100 the one-step process is a very good approximation of the diffusion process.

We also compare the relative error between the diffusion and the one-step process and demonstrate that when we increase the grid size the one-step process converges toward the diffusion process. The results for a particular choice of parameters is shown in Figure S4 (see File S1). We see that the one-step process does converge as expected with increasing grid size. In general we see that a quadratic grid and an exponential grid perform better than a uniform grid in general (see File S1.3 for details). In the applications below we use a quadratic grid of size between 100 and 400.

### Simulations

We picked a population size of *N*_{e} = 500 and set the allele age to *t*_{0} = −1400. We fix the selection coefficient to seven potential values: *γ* ε {−10, −5, 0, 5, 10, 15, 20}.

First, we fix the sampling times to *T* = (−1000, −800, −600, −400, −200, 0) generations and sample 50 chromosomes at each time point. Then we look at a scheme where the samples are taken every 100 generations from −1100 up to 0 (*i.e.*, 12 samples). At each sampling time we sample 25 chromosomes. The intent is to quantify whether it is better to sample more chromosomes at fewer time points, or the opposite.

The boxplot results for the MLEs for these simulations are shown on Figure 2. They are standard boxplots showing the five-point summary (the minimum, the first quartile, the median, the third quartile, and the maximum). The plots for the bias and the RMSE are shown on Figure S5 (see File S1) for both schemes.

For the population size, the MLEs span all the potential range of *N*_{e} values, but the bulk of the results exclude very low population sizes. This suggests nevertheless that it is hard to estimate *N*_{e} with our method, at least with a precision higher than one order of magnitude. Our estimator is biased upward for both schemes but this might be explained by the presence of outliers since the median is largely accurate. Moreover, the second scheme, with fewer chromosomes and more sampling times, leads to a smaller bias and a smaller RMSE for most cases. Intuitively, we think that most of the information to estimate *s* comes from the general trend of change in allele frequency, while most of the information to estimate *N*_{e} comes from the oscillations around that general trend. In other words, to get a precise estimate of *N*_{e}, we need a dense sampling over time, which is not the case for our simulations that we chose to match the real data setting.

In contrast, the results for the selection coefficient are essentially unbiased, with a symmetric distribution, and the median matching the mean of the distribution. The variance remains large, and only when *γ* is quite high can one reject neutrality. In particular, the higher the selection coefficient, the higher the variance. The RMSE this time is worse for the second sampling scheme.

The results for the allele age also exhibit a large variance. The tail of the distribution is large. This can be explained by the use of the conditional process. Indeed for weak selection, if the number of derived alleles is high at the first sampling time the likelihood becomes uninformative for the allele age [*i.e.*, the likelihood is flat for older allele ages; Figure S3 (see File S1)]. This leads to difficulties for the optimization algorithm to converge to the global maximum. The results seem to be systematically biased upward, although the median is accurate. For strong selection the likelihood is more informative and the estimator is unbiased. Also, for strong selection the scheme with more samples through time performs considerably better. In conclusion, sampling fewer chromosomes over more sampling times will lead to better results especially for strong selection in our examples.

### Real data

The change in allelic frequency of this locus is shown in Figure 3. Although the frequency is increasing in around 3000 generations from 0 to ∼0.8 between the first and the third sampling time, suggesting positive selection, it then drops down to 0.4 in ∼500 generations. It is interesting to note that the archeological evidence for domestication suggests a date of 3500 years BC (Outram *et al.* 2009), which would correspond to the third sampling time (*i.e.*, when the sample frequencies start decreasing).

The first step is to choose a potential range for the population size. We found *π* = 0.024 with a 95% CI of (0.018, 0.030). Together with the 95% CI of the mutation rate, this leads to a range for *N*_{e} of (200, 5000). This is a small population size. It might be explained by the fact that the horses are a domesticated species and most samples are taken after the beginning of domestication, resulting in a small *N*_{e}. On the other hand it might be that the mutation rate calculated for the human population for the control region is not appropriate for horses.

In Figure S6 (see File S1) we plot the likelihood surface for four values of *N*_{e}. This helps us confirm that we have found a global maximum. We note that the higher the population size the higher the selection coefficient and the older the allele age that maximize the likelihood. For example, if the population is fixed at *N*_{e} = 200 then *γ*^{max} = −1.5 and *N*_{e} = 5000, then *γ*^{max} = 9.1 and

Since there is no mutant allele at the first time of sampling, the allele might have arisen after the first sampling time. We denote *dom1* the range between (−∞, −3893] generations, and *dom2* the range (−3893, −2516]. As discussed before, the likelihood is therefore discontinuous as a function of the allele age with discontinuities at the sampling times. It is important to look for the global maximum in *dom1* and *dom2* separately. Moreover, we compute the 95% CI in *dom1* and in *dom2* separately. We build the confidence interval as a union of (potentially) disconnected domains.

The values for the MLEs and 95% CI are shown in Table 1. The first thing to note is that they are compatible with the results of Figure S6 (see File S1). The MLEs were found in *dom2*: *γ*^{mle} ≅ −1.3 with CI [−27.7, 60.7], and

In Figure 4 we plot the distribution for the bootstrap replicates for each parameter and for the maximum-likelihood values. The confidence interval was constructed as the interval between 2.5th and 97.5th percentile. We ran a total of 1400 replicates. For about 30 of those simulations, the optimizer did not converge. Among successful runs, ∼500 did not have an MLE in *dom1* or *dom2* and were discarded. From the remaining, about 823 were found in *dom2* and 34 in *dom1*.

The MLEs and the bootstrap results have several implications. First, we do not find evidence for positive selection as could be anticipated by the archeological evidence for domestication. The discrepancy between this study and Ludwig *et al.* (2009) is first the method used and second the parameter range assumed. Indeed, the results in Ludwig *et al.* (2009) were obtained using Bollback *et al.* (2008)’s method. Since our *dom2*, and Bollback *et al.* (2008) assume that the allele was already present in the first time of sampling, it is to be expected that our results will be very different. Moreover, the potential range for the population size in Ludwig *et al.* (2009) is from 10,000 to 100,000; *i.e.*, it does not overlap with the range for *N*_{e} that we assume here. As noted above, if we had assumed a larger population size, the *γ*^{mle} would be larger.

The distribution of each parameter from the bootstrap replicates are almost unbiased relative to the true value (as could be expected from the results in the *Simulation* sections). The distribution for *γ* resembles a normal distribution while the distribution for *N*_{e} and *t*_{0} are not as simple. For *N*_{e}, the distribution is bimodal with a second mode at the upper bound. This mode is a reflection of the finite domain we impose on the search for the MLE rather than an actual mode. Similarly, for *t*_{0} there is a mode at the lower bound for *dom2*, an artifact of the bounds from the sampling times.

As could be expected from the simulations above, the 95% CI for *N*_{e} suggests that with these data we have little ability to estimate *N*_{e}, which we would expect from a sparse sampling over time as discussed earlier. Similarly, we cannot distinguish between negative and positive selection as *γ*’s CI is between −27.7 and 60.7. On the other hand, the bootstrap replicates suggest that the allele arose in *dom2*. We can indeed test the hypothesis that the allele age is not in *dom2*; that is, we can test the null hypothesis *H*_{0}: *t*_{0} ∉ *dom*2 *vs.* the alternative hypothesis that the allele age is in *dom2*, *H*_{1}: *t*_{0} ε *dom2*. We reject the null hypothesis *H*_{0} with *P*-value

The domain *dom2* corresponds to 20,000 to 13,100 years BC. In other words, from the data, one could have already deduced that the allele had to be present before −13,100 years (*i.e.*, before the presumed start of domestication). Indeed, domestication in horses is thought to have started about 3500 years BC (Outram *et al.* 2009). Our analysis shows that it is likely to have arisen within the last 20,000 years, thus clearly indicating that it was present as a standing variant at the time of domestication.

## Conclusion

The allele age, the strength of selection, and the population size are all crucial parameters in population genetics. Although the volume of molecular data have been growing exponentially in recent years, it often remains a challenge to estimate those key parameters.

We develop a maximum-likelihood approach to estimating these parameters that deals with a particular type of data—temporal data. Our method is based on an approximation to the WF diffusion process and has the advantage of being quite flexible and appropriate for hypothesis testing. Moreover, it is fast for small *γ*, as one evaluation of the likelihood function takes ∼0.1 sec for *γ*

We show through simulations that, for a sample of realistic size, although the variance of our estimator is quite large, our MLE is unbiased for estimating selection and is nearly unbiased for the age of the allele and the effective population size. On the other hand, our method is not appropriate for estimating the population size, even for simulations in which the model used to simulate the data matches the method used to infer the parameters. Indeed, for a realistic sampling scenario, the MLEs for *N*_{e}, although unbiased, can span several orders of magnitude. This is not surprising. The effective population size is a parameter notoriously difficult to estimate, and our method considers only a single locus.

The sampling scheme has of course an impact on the accuracy of the estimator. We investigated two different sampling strategies and concluded that, in the cases considered, it is better to increase the number of sampling times rather than the number of samples per time point. It is indeed intuitive that to be able to estimate the allele age, for the conditional process, it is necessary to have a sample taken at a time close to the allele age. Indeed, in the conditional process, an allele will never get fixed or lost. Thus, after several units of rescaled time, the likelihood is flat.

We reanalyze a locus that was previously found to be under positive selection, *ASIP*, by evaluating samples ranging from the Pleistocene to the present. In this study, we test for directional selection and we do not have sufficient resolution to distinguish positive from negative selection for this locus. This may be due to an insufficient amount of data, but it could also be due to an underestimate of the effective population size, or a violation of one or more assumptions of our null model, as discussed earlier. Although we are not able to estimate the selection coefficient as precisely as we would like, we find the age of the ASIP mutation to range between 20,000 to 13,100 years BC with an MLE at 13,400 years BC, which well predates domestication.

Even though we analyze a mammalian data set, our method can in principle be applied to data sets obtained in experimental evolution or viral data. But, it is important to note that our approximation to the WF model will be valid only when the diffusion approximation to the WF model is appropriate, and hence only when 2*N*_{e}*s* is not too large.

Importantly, our framework readily lends itself to being extended to multiple loci, the topic of future investigation. This extension is anticipated to provide greatly improved estimates of *N*_{e} and to permit the inference of fluctuations in historical population size—both issues of outstanding importance in gaining refined estimates of selection coefficients and allele age.

## Acknowledgments

A.S.M. thanks Fernando Perez for help with the numerical analysis and *Numpy*, *Scipy*, Philip Johnson, and Emilia Huerta-Sanchez for helpful discussion. This work was part of A.S.M.’s Ph.D. thesis in the Department of Integrative Biology, University of California, Berkeley. It was funded in part by an Ernst and Lucie Schmidheiny fellowship to A.S.M., by the French Ministry of Industry (DGCIS) and the Region Ile-de-France in the framework of the LaBS Project, by a National Science Foundation grant DMS-0907630 to S.N.E., and by a National Institutes of Health grant (R01-GM40282) to M.S.

## Footnotes

*Communicating editor: L. M. Wahl*

- Received April 7, 2012.
- Accepted June 18, 2012.

- Copyright © 2012 by the Genetics Society of America