Tutorial on maximum likelihood estimation

doi:10.1016/S0022-2496(02)00028-7

Journal of Mathematical Psychology

Volume 47, Issue 1, February 2003, Pages 90-100

https://doi.org/10.1016/S0022-2496(02)00028-7 Get rights and content

Abstract

In this paper, I provide a tutorial exposition on maximum likelihood estimation (MLE). The intended audience of this tutorial are researchers who practice mathematical modeling of cognition but are unfamiliar with the estimation method. Unlike least-squares estimation which is primarily a descriptive tool, MLE is a preferred method of parameter estimation in statistics and is an indispensable tool for many statistical modeling techniques, in particular in non-linear modeling with non-normal data. The purpose of this paper is to provide a good conceptual explanation of the method with illustrative examples so the reader can have a grasp of some of the basic principles.

Introduction

In psychological science, we seek to uncover general laws and principles that govern the behavior under investigation. As these laws and principles are not directly observable, they are formulated in terms of hypotheses. In mathematical modeling, such hypotheses about the structure and inner working of the behavioral process of interest are stated in terms of parametric families of probability distributions called models. The goal of modeling is to deduce the form of the underlying process by testing the viability of such models.

Once a model is specified with its parameters, and data have been collected, one is in a position to evaluate its goodness of fit, that is, how well it fits the observed data. Goodness of fit is assessed by finding parameter values of a model that best fits the data—a procedure called parameter estimation.

There are two general methods of parameter estimation. They are least-squares estimation (LSE) and maximum likelihood estimation (MLE). The former has been a popular choice of model fitting in psychology (e.g., Rubin, Hinton, & Wenzel, 1999; Lamberts, 2000 but see Usher & McClelland, 2001) and is tied to many familiar statistical concepts such as linear regression, sum of squares error, proportion variance accounted for (i.e. r²), and root mean squared deviation. LSE, which unlike MLE requires no or minimal distributional assumptions, is useful for obtaining a descriptive measure for the purpose of summarizing observed data, but it has no basis for testing hypotheses or constructing confidence intervals.

On the other hand, MLE is not as widely recognized among modelers in psychology, but it is a standard approach to parameter estimation and inference in statistics. MLE has many optimal properties in estimation: sufficiency (complete information about the parameter of interest contained in its MLE estimator); consistency (true parameter value that generated the data recovered asymptotically, i.e. for data of sufficiently large samples); efficiency (lowest-possible variance of parameter estimates achieved asymptotically); and parameterization invariance (same MLE solution obtained independent of the parametrization used). In contrast, no such things can be said about LSE. As such, most statisticians would not view LSE as a general method for parameter estimation, but rather as an approach that is primarily used with linear regression models. Further, many of the inference methods in statistics are developed based on MLE. For example, MLE is a prerequisite for the chi-square test, the G-square test, Bayesian methods, inference with missing data, modeling of random effects, and many model selection criteria such as the Akaike information criterion (Akaike, 1973) and the Bayesian information criteria (Schwarz, 1978).

In this tutorial paper, I introduce the maximum likelihood estimation method for mathematical modeling. The paper is written for researchers who are primarily involved in empirical work and publish in experimental journals (e.g. Journal of Experimental Psychology) but do modeling. The paper is intended to serve as a stepping stone for the modeler to move beyond the current practice of using LSE to more informed modeling analyses, thereby expanding his or her repertoire of statistical instruments, especially in non-linear modeling. The purpose of the paper is to provide a good conceptual understanding of the method with concrete examples. For in-depth, technically more rigorous treatment of the topic, the reader is directed to other sources (e.g., Bickel & Doksum, 1977, Chap. 3; Casella & Berger, 2002, Chap. 7; DeGroot & Schervish, 2002, Chap. 6; Spanos, 1999, Chap. 13).

Section snippets

Probability density function

From a statistical standpoint, the data vector y=(y₁,…,y_m) is a random sample from an unknown population. The goal of data analysis is to identify the population that is most likely to have generated the sample. In statistics, each population is identified by a corresponding probability distribution. Associated with each probability distribution is a unique value of the model's parameter. As the parameter changes in value, different probability distributions are generated. Formally, a model is

Maximum likelihood estimation

Once data have been collected and the likelihood function of a model given the data is determined, one is in a position to make statistical inferences about the population, that is, the probability distribution that underlies the data. Given that different parameter values index different probability distributions (Fig. 1), we are interested in finding the parameter value that corresponds to the desired probability distribution.

The principle of maximum likelihood estimation (MLE), originally

Illustrative example

In this section, I present an application example of maximum likelihood estimation. To illustrate the method, I chose forgetting data given the recent surge of interest in this topic (e.g. Rubin & Wenzel, 1996; Wickens, 1998; Wixted & Ebbesen, 1991).

Among a half-dozen retention functions that have been proposed and tested in the past, I provide an example of MLE for the two functions, power and exponential. Let w=(w_1,w₂) be the parameter vector, t time, and p(w,t) the model's prediction of the

Concluding remarks

This article provides a tutorial exposition of maximum likelihood estimation. MLE is of fundamental importance in the theory of inference and is a basis of many inferential techniques in statistics, unlike LSE, which is primarily a descriptive tool. In this paper, I provide a simple, intuitive explanation of the method so that the reader can have a grasp of some of the basic principles. I hope the reader will apply the method in his or her mathematical modeling efforts so a plethora of widely

Acknowledgements

This work was supported by research Grant R01 MH57472 from the National Institute of Mental Health. The author thanks Mark Pitt, Richard Schweickert, and two anonymous reviewers for valuable comments on earlier versions of this paper.

References (20)

W.H. Batchelder et al.
Multinomial processing tree models of factorial categorization
Journal of Mathematical Psychology
(1997)
I.J. Myung et al.
Special issue on model selection
Journal of Mathematical Psychology
(2000)
Akaike, H. (1973). Information theory and an extension of the maximum likelihood principle. In: Petrox, B.N., & Caski,...
P.J. Bickel et al.
Mathematical statistics
(1977)
G. Casella et al.
Statistical inference
(2002)
M.H. DeGroot et al.
Probability and statistics
(2002)
S. Kirkpatrick et al.
Optimization by simulated annealing
Science
(1983)
K. Lamberts
Information-accumulation theory of speeded categorization
Psychological Review
(2000)
H. Linhart et al.
Model selection
(1986)
B.B. Murdock
The retention of individual items
Journal of Experimental Psychology
(1961)

There are more references available in the full text version of this article.

Cited by (1249)

Explanations based on Item Response Theory (eXirt): A model-specific method to explain tree-ensemble model in trust perspective
2024, Expert Systems with Applications
Solutions based on tree-ensemble models represent a considerable alternative to real-world prediction problems, but these models are considered black box, thus hindering their applicability in problems of sensitive contexts (such as: health and safety). Explainable Artificial Intelligence (XAI) aims to develop techniques that generate explanations of black box models, since these models are normally not self-explanatory. Methods such as Ciu, Dalex, Eli5, Lofo, Shap and Skater emerged with the proposal to explain black box models through global rankings of feature relevance, which based on different methodologies, generate global explanations that indicate how the model’s inputs explain its predictions. This research aims to present an innovative XAI method, called eXirt, capable of carrying out the process of explaining tree-ensemble models, based on Item Response Theory (IRT). In this context, 41 datasets, 4 tree-ensemble algorithms (Light Gradient Boosting, CatBoost, Random Forest, and Gradient Boosting), and 7 XAI methods (including eXirt) were used to generate explanations. In the first set of analyses, the 164 ranks of global feature relevance generated by eXirt were compared with 984 ranks of the other XAI methods present in the literature, being verified that the new method generated different explanations from other existing methods. In a second analysis, exclusive local and global explanations generated by eXirt were presented that help in understanding the model trust, since in this explanation it is possible to observe particularities of the model regarding difficulty (if the model had difficulty predicting the test dataset), discrimination (if the model understands the test dataset as discriminative) and guesswork (if the model got the test dataset right by chance). Thus, it was verified that eXirt is able to generate global explanations of tree-ensemble models and also local and global explanations of models through IRT, showing how this consolidated theory can be used in machine learning in order to obtain explainable and reliable models.
Uncertainty quantification of cuffless blood pressure estimation based on parameterized model evidential ensemble learning
2024, Biomedical Signal Processing and Control
Cuffless blood pressure (BP) measurement has the potential to break through the way to detect and prevent hypertension, but it is still challenging in meeting clinical performance requirements. The limited accuracy of current cuffless BP measurement is mainly attributed to the epistemic (model) and aleatoric (data) uncertainties of the estimation methods. However, few previous studies have considered this problem. In this study, we propose a parameterized model evidential ensemble learning (PEEL) framework with the aim to reduce the model uncertainty (so as to improve the performance) and quantify the uncertainty. The PEEL framework consists of two stages: original BP estimations with parameterized estimation models in the first stage, and, a neural network to estimate the evidential distribution of the final BP estimation in the second stage. Experiments on 96 subjects with MIMIC III dataset show that the estimation error with the PEEL model for systolic and diastolic BP is 3.74 mmHg and 2.22 mmHg, respectively. PEEL model has the potential to reduce the model uncertainty and to improve the performance of cuffless BP estimation. Furthermore, the estimated uncertainty can be used as a confidence interval to assist in diagnosing hypertension and support clinical decisions.
SubEpiPredict: A tutorial-based primer and toolbox for fitting and forecasting growth trajectories using the ensemble n-sub-epidemic modeling framework
2024, Infectious Disease Modelling
An ensemble n-sub-epidemic modeling framework that integrates sub-epidemics to capture complex temporal dynamics has demonstrated powerful forecasting capability in previous works. This modeling framework can characterize complex epidemic patterns, including plateaus, epidemic resurgences, and epidemic waves characterized by multiple peaks of different sizes. In this tutorial paper, we introduce and illustrate SubEpiPredict, a user-friendly MATLAB toolbox for fitting and forecasting time series data using an ensemble n-sub-epidemic modeling framework. The toolbox can be used for model fitting, forecasting, and evaluation of model performance of the calibration and forecasting periods using metrics such as the weighted interval score (WIS). We also provide a detailed description of these methods including the concept of the n-sub-epidemic model, constructing ensemble forecasts from the top-ranking models, etc. For the illustration of the toolbox, we utilize publicly available daily COVID-19 death data at the national level for the United States. The MATLAB toolbox introduced in this paper can be very useful for a wider group of audiences, including policymakers, and can be easily utilized by those without extensive coding and modeling backgrounds.
Learning bayesian network parameters from limited data by integrating entropy and monotonicity
2024, Knowledge-Based Systems
High-accuracy parameter learning in Bayesian Networks (BNs) is a key challenge in real-time decision support applications, particularly when the available data are limited. Prior/Expert knowledge was introduced to eliminate the drawbacks of insufficient information; however, this method is subjective. In this study, we explored the use of monotonicity constraints to control the causal relationships between the nodes and their parents-nodes in BNs and proposed a new learning algorithm called global domain monotonicity based on maximum information entropy (GDM-MIE), which was designed for parameter learning in uncertain discrete BNs with nonlinear equality constraints when only finite data are available. In the proposed algorithm, a class of monotonicity is encoded as a constraint on information entropy and the parameter learning problem is transformed into constraints among the node parameters based on a known network. Furthermore, we considered the parameters as uncertain entropy information and discussed the monotonicity among parameters in the global spatial domain, proving the accuracy of the logical relationships of the model, system reliability, and time complexity. Finally, the proposed method was validated using standard BNs, and its performance was analyzed by comparing the proposed method with the existing learning algorithms. The results showed that the proposed method is more accurate and has better Kullback–Leibler divergence. To revalidate the rationality of the proposed method, the Alarm and Asia networks were employed as special cases. The GDM-MIE was found to achieve the intended goal of observing the estimated parameters by closely approximating the original real parameters with a small sample size, indicating that the proposed algorithm can served as an efficient and feasible method for learning Bayesian parameter.
Wave impact pressure and pressure impulse on a square column with an overhanging deck in regular waves
2024, Marine Structures
As the main component of semi-submersible platforms, the deck-column structure is inevitably subjected to severe wave impacts in extreme ocean environments. The wave impact pressure and its time integral, impact pressure impulse, play an important role in the structural design, especially for the analysis of dynamic response. To investigate the characteristics of impact pressure and pressure impulse, wave impact tests on a square column with overhanging deck were carried out under a series of regular waves. Wavelet denoising technique and empirical mode decomposition (EMD) method were employed to remove the noise interference and the dynamic amplification effect from raw data. Considering the stochastic feature, statistical analysis of pressure impulse was performed based on generalized extreme value (GEV) and two-parameter Weibull distribution models. The results show the strong variability of the wave pressure impulse and the necessity of statistical analysis even in regular waves. The peak impact pressure is more sensitive to the variation of initial air gap and column incline angle than the pressure impulse. The probability distribution of extreme pressure impulse agrees well with the GEV distribution and is mainly distinguished by the shape parameter. Furthermore, the probability distribution of extreme pressure impulse is significantly affected by the initial air gap, and the effect is also related to wave parameters. The present research reveals the probability distribution of wave pressure impulse in regular waves and lays a foundation for the statistical analysis of pressure impulse in irregular waves.
A tutorial on Bayesian inference for dynamical modeling of eye-movement control during reading
2024, Journal of Mathematical Psychology
Dynamical models are crucial for developing process-oriented, quantitative theories in cognition and behavior. Due to the impressive progress in cognitive theory, domain-specific dynamical models are complex, which typically creates challenges in statistical inference. Mathematical models of eye-movement control might be looked upon as a representative case study. In this tutorial, we introduce and analyze the SWIFT model (Engbert et al., 2002; Engbert et al., 2005), a dynamical modeling framework for eye-movement control in reading that was developed to explain all types of saccades observed in experiments from an activation-based approach. We provide an introduction to dynamical modeling, which explains the basic concepts of SWIFT and its statistical inference. We discuss the likelihood function of a simplified version of the SWIFT model as a key foundation for Bayesian parameter estimation (Rabe et al., 2021; Seelig et al., 2019). In posterior predictive checks, we demonstrate that the simplified model can reproduce interindividual differences via parameter variation. All computations in this tutorial are implemented in the R-Language for Statistical Computing and are made publicly available. We expect that the tutorial might be helpful for advancing dynamical models in other areas of cognitive science.

View all citing articles on Scopus

View full text

TutorialTutorial on maximum likelihood estimation

Abstract

Introduction

Section snippets

Probability density function

Maximum likelihood estimation

Illustrative example

Concluding remarks

Acknowledgements

Journal of Mathematical Psychology

Journal of Mathematical Psychology

Mathematical statistics

Statistical inference

Probability and statistics

Optimization by simulated annealing

Science

Information-accumulation theory of speeded categorization

Psychological Review

Model selection

The retention of individual items

Journal of Experimental Psychology

Tutorial
Tutorial on maximum likelihood estimation