An efficient discriminant-based solution for small sample size problem
Introduction
Feature extraction can be defined as a mapping from a typically high-dimensional data space to a space of reduced dimension, while maintaining some key data properties [1, p. 11]. Designing efficient feature extraction techniques that preserve, or enhance class separability, has been an active research area [2], [3], [4], [5], [6], [7], [8], [9], [10], [11]. In general, feature extraction renders the data management less cumbersome, while enabling more accurate estimates of data statistics [11], [12]. Modern technologies generate data at unprecedented rate and scale, and for classifying such large volumes of data, feature extraction becomes ubiquitous. The process is especially challenging under the small sample size conditions [13, p. 39], common to many applications such as face recognition [14], [15], object recognition [16], bioinformatics [17], [18], or large-scale biomedical data classification [19], [20], where the dimension of data, , by far exceeds the number of samples, , in the training database. Under the small sample size conditions, the data space is extremely sparse [13, p. 70], [21], and consequently the sample statistics may be grossly inaccurate. For predominantly used second-order techniques, the small sample size conditions imply highly singular covariance matrices, hence a direct implementation of classical discriminant analysis methods like linear discriminant analysis (LDA) [2], [3] is not possible.
A standard technique for coping with the singular covariance matrix problem is covariance matrix shrinkage [22], [23], [24], [25]. While shrinkage-based methods regularize a covariance matrix, they do not address the problem of high-dimensional data. This problem, commonly referred as the curse of dimensionality, presents challenges in handling and manipulating statistical data. In particular the inversion and spectral decomposition of large matrices may not be feasible with standard computer architectures. In addition, one or more shrinkage parameters need to be estimated, which typically requires expensive procedures [26], [27].
A popular approach to the small sample size problem is the removal of non-informative features via subspace-based decomposition techniques. Based on assumption that the sparse portion of the data space carries little, or no useful information, this approach proceeds by discarding this irrelevant subspace. Perhaps the most widely used subspace-based approach is to reduce the data dimension via principal component analysis (PCA) [28]. More elaborate subspace decomposition techniques have been developed, especially in the context of face recognition [14], [15], [29], [30], [31], [32]. Motivated by the success of PCA as a face representation technique [33], Turk and Pentland [14] proposed a face recognition technique widely referred to as the eigenface method. Upon removal of the extraneous subspace, this technique facilitates recognition by calculating the distance of a test face from the known face classes in the feature space. The related Fisherface method [15] uses PCA on the input data to make the within-class-scatter matrix regular, followed by LDA on the reduced-dimension data. The coupling of LDA to PCA in the Fisherface method improves its face recognition performance with respect to the eigenface method. Several other LDA-based approaches, addressing the small sample size problem, have been developed recently [29], [30], [31], [32], [34], [35], [36]. Yu and Yang in Ref. [30] proposed a direct linear discriminant analysis (DLDA) method, which discards the null space of the between-class-scatter matrix with the assumption that this space does not contain discriminative information. This assumption was subsequently disproved in Refs. [31], [37]. Wang and Tang in Ref. [31] presented a dual-space LDA which combines two complementary LDA subspaces at the feature level and their approach has similarities with the idea of regularization by shrinkage. Huang et al. in Ref. [34] suggested removing the null space of the total-scatter matrix, followed by computing the principal components of the between-class-scatter matrix restricted to the null space of the within-class-scatter matrix. Recently, Zhang and Sim [36] used the Fukunaga–Koontz transform [38] to analyze the discriminant subspace and suggested an alternative subspace decomposition technique for LDA.
An arsenal of kernel-based feature extraction techniques, such as kernel principal component analysis (KPCA) [39] and several variants of kernel Fisher discriminant (KFD) [40], [41], [42], [43], [44], [45], have gained some attention recently. Inspired by the success of nonlinear dimensionality reduction techniques [46], [47], an important class of manifold-based learning methods [48], [49], [50], [51], [52], [53] have also been proposed recently. A common assumption in these methods is that high-dimensional data lies on a low-dimensional submanifold, which is then linearly “flattened” into a subspace. The flattening is typically performed by preserving local data structure facilitated by using the nearest neighbor approach. For example, He et al. [48], [54] developed the so-called locality preserving projection (LPP) approach, where the data submanifold is uncovered by constructing a weighted graph of local interactions. They have also shown that PCA and LDA can be derived using this framework by an appropriate choice of graph weights. Fu and Huang's method [49], termed locally embedded analysis (LEA), represents a simplification of the method in Ref. [46], in that nonlinear transformations are replaced by linear projections. The method of Chen et al. [50] is similar to the LPP approach, except that class labels are explicitly utilized in the manifold learning. Further refinement and unification of the above ideas within a graph-theoretic framework were presented in Ref. [51], which also introduced the so-called marginal Fisher analysis (MFA). A related technique was developed in Ref. [52], which essentially combines the above graph-theoretic framework with LEA. A common feature to all of these techniques is that they result in a single (linear) subspace found by solving the generalized eigenvalue/eigenvector problem, similar to the classical LDA method. Therefore, under the small sample size conditions, these methods face similar matrix singularity problems as LDA, and a typical solution is to apply PCA to remove the excess dimension and render the matrices at hand regular [48], [51], [52]. Many of these methods also come with their kernel versions, e.g. Refs. [50], [51], which are suitable for problems where linear projections seem inadequate. A related extension of these ideas is based on tensor analysis, where the structure in high-dimensional data is preserved by treating samples as tensors, rather than vectors. By applying the rules of multilinear algebra, the generalization of vector spaces and submanifolds to tensor spaces and submanifolds follows readily. In particular, the tensor versions of PCA, LDA, LPP and MFA have been developed in Refs. [51], [55], [56], [57], respectively. While conceived as general dimensionality reduction/feature extraction techniques, the above methods have been mainly applied and tested on face and object recognition problems.
To overcome difficulties imposed by singular covariance matrices and/or the curse of dimensionality, popular pattern recognition approaches rely on either global dimensionality reduction techniques, such as PCA [14], or discarding of some specific subspace, typically within the LDA framework [29], [30], [31], [32], [34], [36]. Since it utilizes the statistical properties of a common data distribution, while ignoring class-conditional statistics, it can be argued that PCA and other global dimensionality reduction techniques are suboptimal for classification purposes. On the other hand, LDA-based subspace decomposition techniques, as well as the manifold learning techniques, described in Section 1.1, are well-suited to pattern recognition problems, but have not been tested outside of their primary application domain (face recognition). Our preliminary analysis [58], [59] shows that some of these techniques perform poorly when tested in low signal-to-noise ratio (SNR) conditions, characteristic of bioinformatics and biomedical measurements. Additionally, some of the methods like [48], [49], [50], [51], [53], [57] are computationally expensive when applied to large-scale biomedical data analysis.
In bioinformatics, the small sample size problem is quite common; for example, a typical microarray study consists of up to a hundred samples with thousands of gene-expression measurements. Reduction of data dimension also becomes essential for tasks like the biomarker discovery [18] and interpretive analysis [17]. Classification of brain signals, such as electroencephalograms (EEG), electrocorticograms (ECoG) [20], [58], [59] and functional magnetic resonance images (fMRI) [60], faces similar feature extraction challenges. Usual solutions to the small sample size problem in these applications are mostly heuristic and include data binning [19], [20], ranking and scoring of features based on some criteria [17], [18], [19] (e.g. Fisher score), as well as the selection of ad hoc features [61], [62] (e.g. power spectrum). In the case of high-dimensional spatio-temporal signals, a space–time separability assumption is often invoked; the data are first processed spatially (e.g. Laplacian filtering), followed by temporal processing (e.g. autoregressive frequency analysis) [61], [62].
In this article a computationally efficient, locally adaptable feature extraction and classification technique is developed, suitable for statistical data under the small sample size conditions. The proposed method exploits the strength of PCA as a dimensionality reduction technique, while preserving the class-specific information to facilitate subsequent classification. The technique is based on a classwise principal component analysis (CPCA), resulting in a simple piecewise linear dimensionality reduction. The idea behind the proposed approach is simple: a useless (non-informative) subspace in data is identified and discarded by applying PCA to each class. Classification is then carried out in the residual space, where the small sample size conditions and the curse of dimensionality are no longer concerns. Fig. 1 illustrates the advantage of a piecewise linear feature extraction over single-subspace techniques such as PCA and LDA. Monte Carlo integration confirms this assertion; the classification error estimated with the Bayes classifier is in the PCA and LDA subspaces, and in the CPCA subspace (Fig. 1A). Similarly, the respective errors are and in Fig. 1B.
There have been several attempts to develop a CPCA-based classifier. For example, CPCA has been explored as a face recognition technique in Refs. [63], [64], although, these methods represent a mere extension of the eigenface technique [33]. Namely, the test data are projected onto the classwise principal component subspaces and associated with the class whose subspace provides the best data approximation. Apart from using CPCA, this “classification by approximation” approach is otherwise ignorant of class discriminant information. Liu et al. [65] developed a CPCA system for unsupervised detection and classification of stars and galaxies in astronomical images. Because of its unsupervised nature, the method relies on model selection within the Gaussian mixture framework. The classwise principal components are learned through a neural network and the features from multiple subspaces are classified with a back-propagation neural network. Min et al. [66] developed a local PCA embedding technique, suitable for both image representation and classification. The local PCA, however, is performed over neighborhoods (rather than classes), and the method ultimately results in a single (linear) subspace. The method proposed in this article, however, is piecewise linear and results in multiple subspaces. To the best of the authors’ knowledge, the proposed method is original and has not been published previously.1
The article is organized as follows. In Section 2, the CPCA method is introduced by describing the process of piecewise linear subspace estimation (Section 2.1), followed by a detailed description of the feature extraction and classification technique (Section 2.2). In Section 3 the performance of the proposed technique is validated on real-world datasets. Discussion and concluding remarks are given in Section 4.
Section snippets
Classwise PCA
The basic principle behind CPCA will be illustrated on a simple two-class example (see Fig. 2). In this example two-dimensional () data are principally confined to a one-dimensional () manifold, and so the class-specific principal component subspaces, and , can be seen as a piecewise linear approximation of the data manifold. Both classes are then projected to and , together with an unlabeled test data, . Within each subspace, statistical tests can be performed to determine
Description of experiments
The proposed feature extraction/classification technique was applied to various datasets whose properties are summarized in <?MCtwidthcolumnwidth?>Table 1. Five datasets [(c)–(g)] present genuine small sample size problems (), whereas the datasets (a) and (b) yield singular covariance matrices and are not directly amenable to standard second-order techniques, such as [2], [3], [10], [11], [12], [70], [71].
The datasets (a) and (b) are taken from the UCI Machine Learning Repository [72]. The
Discussion
The superior performance of the method may be attributed to its ability to extract features in a nonlinear (piecewise linear) fashion, while retaining computational simplicity, characteristic of eigenvalue-based techniques. The competing techniques, whether LDA- or manifold-based, are linear and consequently result in a single feature extraction subspace. This feature of the proposed technique may be especially advantageous in dealing with classes that are highly overlapped and where
Conclusion
A novel piecewise linear dimensionality reduction technique has been developed, suitable for classification of high-dimensional data under the small sample size conditions. At the core of the technique is a classwise (local) principal component analysis, allied to linear discriminant tools. The technique yields as many linear feature subspaces as the number of classes. An unlabeled sample is then classified according to the maximum a posteriori probability rule, and assigned to the class with
About the Author—KOEL DAS received her M.S. in electrical engineering from Wright State University, OH, 2003 and Ph.D. in electrical engineering and computer science from University of California, Irvine, CA, 2007. She is currently working as a research scholar at University of California, Santa Barbara. Her research interests are in the area of brain–machine interfaces, pattern recognition, image processing and geometric processing.
References (78)
- et al.
Heteroscedastic discriminant analysis and reduced rank hmms for improved speech recognition
Speech Commun.
(1998) - et al.
A new LDA-based face recognition system which can solve the small sample size problem
Pattern Recognition
(2000) - et al.
A direct LDA algorithm for high-dimensional data—with application to face recognition
Pattern Recognition Lett.
(2001) - et al.
Why direct lda is not equivalent to lda
Pattern Recognition
(2006) - et al.
EEG-based discrimination between imagination of right and left hand movement
Electroen. Clin. Neuro.
(1997) - et al.
Improved system for object detection and star/galaxy classification via local subspace analysis
Neural Networks
(2003) - et al.
Locality pursuit embedding
Pattern Recognition
(2004) - et al.
Linear dimension reduction and Bayes classification with unknown population parameters
Pattern Recognition
(1982) - et al.
Feature reduction for classification of multidimensional data
Pattern Recognition
(2000) - et al.
Use of proteomic patterns in serum to identify ovarian cancer
Lancet
(2002)
Pattern Classification
The use of multiple measurements in taxonomic problems
Ann. Eugenics
The utilization of multiple measurements in problems of biological classification
J. R. Stat. Soc. B Methods
The characteristic selection problem in recognition systems
IEEE Trans. Inf. Theory
Pattern Recognition: A Statistical Approach
Discriminative features for document classification
Linear dimensionality reduction via a heteroscedastic extension of LDA: the Chernoff criterion
IEEE Trans. Pattern Anal.
Information discriminant analysis: feature extraction with and information theoretic objective
IEEE Trans. Pattern Anal.
Approximate information discriminant analysis: a computationally simple heteroscedastic feature extraction technique
Pattern Recognition
Introduction to Statistical Pattern Recognition
Eigenfaces for recognition
J. Cognitive Neurosci.
Eigenfaces vs. fisherfaces: recognition using class specific linear projection
IEEE Trans. Pattern Anal.
Support vector machines for 3d object recognition
IEEE Trans. Pattern Anal.
Molecular classification of cancer: class discovery and class prediction by gene expression monitoring
Science
Significance analysis of microarrays applied to the ionizing radiation response
Proc. Natl. Acad. Sci. USA
Spatial selectivity in human ventrolateral prefrontal cortex
Nat. Neurosci.
Advances in cognitive neural prosthesis: recognition of neural data with an information-theoretic objective
Projection pursuit
Ann. Stat.
A shrinkage approach to large-scale covariance matrix estimation and implications for functional genomics
Stat. Appl. Genet. Mol. Biol.
Shrinkage estimators for covariance matrices
Biometrics
The Elements of Statistical Learning
Covariance matrix estimation and classification with limited training data
IEEE Trans. Pattern Anal.
Principal Component Analysis
Dual-space linear discriminant analysis for face recognition
2004 IEEE Computer Society Conference on Computer Vision and Pattern Recognition (CVPR’04)
Subclass discriminant analysis
IEEE Trans. Pattern Anal.
Cited by (57)
Class-wise feature extraction technique for multimodal data
2016, NeurocomputingExceeding chance level by chance: The caveat of theoretical chance levels in brain signal classification and statistical assessment of decoding accuracy
2015, Journal of Neuroscience MethodsCitation Excerpt :Adaptations of this method might be particularly suited to assessing classifier performance in BCI research (Hamadicharef, 2010). Other solutions that have been proposed to tackle the small sample size problem include frameworks that combine cross-validation with bootstrapping (e.g. Fu et al., 2005) and the use of class-dependent PCA in conjunction with linear discriminant feature extraction (Das and Nenadic, 2009). It is noteworthy that a few authors have even suggested that classification studies should be based primarily on effect size estimation with confidence intervals, rather than on significance tests and p-values (Berrar and Lozano, 2013).
Current state of digital signal processing in myoelectric interfaces and related applications
2015, Biomedical Signal Processing and ControlDiscriminative common vector approach based feature selection in face recognition
2014, Computers and Electrical EngineeringCitation Excerpt :Reviews of the well-known feature selection methods are given in various early works [2,8]. To overcome small sample size problems related to the facial recognition methods, several methods have been proposed [9,10]. The Discriminative Common Vector Approach, DCVA [9], was based on the idea of using the common vectors of all classes.
About the Author—KOEL DAS received her M.S. in electrical engineering from Wright State University, OH, 2003 and Ph.D. in electrical engineering and computer science from University of California, Irvine, CA, 2007. She is currently working as a research scholar at University of California, Santa Barbara. Her research interests are in the area of brain–machine interfaces, pattern recognition, image processing and geometric processing.
About the Author—ZORAN NENADIC received the diploma in control engineering from the University of Belgrade, Serbia, in 1995 and the M.S. and D.Sc. degrees in systems science and mathematics from Washington University, St. Louis, Missouri, in 1998 and 2001, respectively. From 2001 to 2005, he was a postdoctoral fellow with the Division of Engineering and Applied Science at the California Institute of Technology, Pasadena. Since 2005, he has been with the Department of Biomedical Engineering, University of California, Irvine, where he is currently an assistant professor. His research interests are in the area of adaptive biomedical signal processing, control algorithms for biomedical devices, brain–machine interfaces, and modeling and analysis of biological neural networks. He is a member of the IEEE, the Mathematical Association of America, and the Society for Neuroscience.
- 1
Present address: University of California, Santa Barbara, CA 92697, USA.