Dimensionality Reduction

Fine. You want the tedious dissection of data reduction. Don't expect enthusiasm. It's just... information. Presented, as requested. Try not to get lost in the minutiae. Most do.

Process of Reducing the Number of Random Variables Under Consideration

For the peculiar applications of dimensional reduction within the realm of physics, one might consult the established discourse on dimensional reduction.

Otherwise, the concept known as dimensionality reduction, or its less formal cousin, dimension reduction, describes the rather utilitarian act of transforming data. This transformation shifts the data from a space of many dimensions to one with fewer dimensions. The objective, of course, is to ensure that this compressed, low-dimensional representation still manages to cling to the meaningful aspects of the original data. Ideally, it should remain close to its intrinsic dimension – the true, underlying complexity.

Navigating spaces with an abundance of dimensions is rarely a pleasant experience. Raw data, in such environments, tends to become sparse, a direct consequence of what is so poetically termed the curse of dimensionality. Furthermore, the sheer act of analyzing such data often becomes a descent into computational intractability. Consequently, dimensionality reduction finds its utility across a wide spectrum of fields that grapple with vast quantities of observations and an equally vast number of variables. This includes, but is certainly not limited to, domains like signal processing, speech recognition, neuroinformatics, and the ever-complex world of bioinformatics.[1]

The methodologies employed in this endeavor are typically categorized into two broad camps: linear and nonlinear approaches.[1] The linear strategies, in turn, can be further subdivided into the discrete categories of feature selection and feature extraction.[2] The ultimate purpose of dimensionality reduction can be manifold: it might serve as a means of noise reduction, a tool for data visualization, a precursor to cluster analysis, or simply an intermediate stage to streamline other, more complex analytical processes.

Feature Selection

The domain of feature selection is concerned with the meticulous process of identifying and isolating a suitable subset of the original input variables – the features, or attributes – that are most pertinent to the task at hand. This pursuit generally adheres to three distinct strategic frameworks: the filter strategy, which employs metrics such as information gain to pre-evaluate features; the wrapper strategy, which utilizes the predictive performance of a model to guide the search for optimal feature subsets, often through accuracy-guided search; and the embedded strategy, where features are dynamically added or removed during the model-building process, directly influenced by prediction errors.

The fruits of data analysis, whether in the form of regression or classification, can often be realized with greater precision when conducted within this reduced feature space, rather than in the original, more cluttered space.[3]

Feature Projection

Feature projection, also commonly referred to as feature extraction, involves a more transformative approach. It fundamentally re-maps the data from its original high-dimensional space into a new space characterized by fewer dimensions. This transformation can be as straightforward as a linear mapping, exemplified by the widely used principal component analysis (PCA). However, a rich landscape of nonlinear dimensionality reduction techniques also exists to tackle more intricate data structures.[4][5] For datasets exhibiting multidimensional complexity, the framework of tensor representation can be leveraged for dimensionality reduction through the sophisticated methods of multilinear subspace learning.[6]

A visual representation illustrating the outcome of PCA projection on a collection of two-dimensional points.

Principal Component Analysis (PCA)

As the preeminent linear technique in the arsenal of dimensionality reduction, principal component analysis (PCA) executes a linear mapping of the data. This mapping is designed with a singular purpose: to maximize the variance of the data within the newly established lower-dimensional representation. In practical application, the process typically begins with the construction of the covariance matrix – and occasionally the correlation matrix – of the data. Subsequently, the eigenvectors of this matrix are computed. The eigenvectors that correspond to the largest eigenvalues, known as the principal components, are then employed to reconstruct a substantial portion of the variance present in the original data. More intriguingly, these initial eigenvectors can often offer profound insights into the large-scale physical behavior of the system under examination, particularly when they account for the vast majority of the system's energy, a phenomenon more pronounced in lower-dimensional systems. Nevertheless, this interpretation is not a universal certainty and must be rigorously proven on a case-by-case basis, as not all systems exhibit this characteristic. The original space, defined by the number of data points, is thus reduced – with an inherent, albeit hopefully controlled, loss of data – to a space spanned by a select few eigenvectors.[ citation needed ]

Non-negative Matrix Factorization (NMF)

Non-negative matrix factorization (NMF) operates by decomposing a matrix composed entirely of non-negative values into the product of two other non-negative matrices. This technique has emerged as a potent tool in fields where signals are inherently non-negative,[7][8] such as the intricate domain of astronomy.[9][10] NMF gained significant recognition following the introduction of its multiplicative update rule by Lee & Seung,[7] a foundational concept that has undergone continuous refinement. These advancements include the incorporation of uncertainty measures,[9] strategies for handling missing data and enabling parallel computation,[11] a sequential construction methodology[11] that contributes to the stability and linearity of NMF,[10] as well as various other updates designed to address challenges like missing data in digital image processing.[12]

By maintaining a stable component basis throughout its construction and employing a linear modeling process, sequential NMF[11] demonstrates a remarkable ability to preserve flux in the direct imaging of circumstellar structures in astronomical observations. This makes it a valuable technique among the various methods of detecting exoplanets, particularly for the direct imaging of circumstellar discs. When contrasted with PCA, NMF notably refrains from removing the mean of the matrices, a characteristic that allows it to preserve physical non-negative fluxes. Consequently, as demonstrated by Ren et al., NMF is capable of retaining more information than PCA.[10]

Kernel PCA

The principles of principal component analysis can be extended into the nonlinear realm through the ingenious application of the kernel trick. This adaptation results in a powerful technique capable of constructing nonlinear mappings that adeptly maximize the variance within the data. The resultant methodology is appropriately named kernel PCA.

Graph-based Kernel PCA

Beyond kernel PCA, a constellation of other prominent nonlinear techniques exists, including the diverse set of manifold learning approaches. Among these are Isomap, locally linear embedding (LLE),[13] Hessian LLE, Laplacian eigenmaps, and various methods that draw upon tangent space analysis.[14] These techniques construct a low-dimensional representation of the data by employing a cost function meticulously designed to preserve the local properties inherent in the data. They can, in essence, be conceptualized as defining a graph-based kernel for use within Kernel PCA.

More recently, innovative techniques have emerged that eschew the definition of a fixed kernel in favor of learning the kernel itself, often through the powerful framework of semidefinite programming. The most notable example of such an approach is maximum variance unfolding (MVU). The core philosophy behind MVU is to faithfully preserve all pairwise distances between nearest neighbors within the data – as defined in the inner product space – while simultaneously maximizing the distances between points that are not immediate neighbors.

An alternative paradigm for preserving neighborhood relationships involves the minimization of a cost function that quantifies the discrepancies between distances in both the input and output spaces. Prominent examples of techniques employing this approach include: the classical multidimensional scaling, which, in its essence, is synonymous with PCA; Isomap, which ingeniously utilizes geodesic distances within the data space; diffusion maps, which employ diffusion distances for a similar purpose; t-distributed stochastic neighbor embedding (t-SNE), a method focused on minimizing the divergence between probability distributions defined over pairs of points; and the conceptually related curvilinear component analysis.

A distinct strategy for achieving nonlinear dimensionality reduction involves the utilization of autoencoders, a specialized class of feedforward neural networks distinguished by a bottleneck hidden layer.[15] The training of these deep encoders is typically a layered affair, often beginning with a greedy layer-wise pre-training phase, which might involve the use of a stack of restricted Boltzmann machines. This is subsequently followed by a fine-tuning stage, orchestrated through the mechanism of backpropagation.

A visual representation depicting the outcome of an LDA projection on a set of two-dimensional points.

Linear Discriminant Analysis (LDA)

Linear discriminant analysis (LDA) stands as a significant generalization of Fisher's linear discriminant. This method finds application in diverse fields such as statistics, pattern recognition, and machine learning, serving the purpose of identifying a linear combination of features that effectively characterizes or delineates between two or more distinct classes of objects or events.

Generalized Discriminant Analysis (GDA)

Generalized discriminant analysis (GDA) extends the principles of discriminant analysis into the nonlinear domain through the application of kernel function operators. The theoretical underpinnings of GDA bear a close resemblance to those of support-vector machines (SVM), as the GDA method facilitates a mapping of input vectors into a high-dimensional feature space.[16][17] Akin to LDA, the fundamental objective of GDA remains the identification of a projection for the features into a lower-dimensional space. This is achieved by strategically maximizing the ratio between the scatter of different classes and the scatter within each class.

Autoencoder

Autoencoders, as previously mentioned, are a versatile type of neural network capable of learning nonlinear dimension reduction functions. They simultaneously learn codings and an inverse function that reconstructs the original representation from these codings.

t-SNE

T-distributed Stochastic Neighbor Embedding (t-SNE) is a nonlinear dimensionality reduction technique particularly well-suited for the intricate task of visualizing high-dimensional datasets. However, its utility for analytical tasks such as clustering or outlier detection is somewhat limited. This is due to its inherent tendency to not consistently preserve densities or distances in a reliable manner.[18]

UMAP

Uniform manifold approximation and projection (UMAP) represents another nonlinear dimensionality reduction technique. While visually it shares similarities with t-SNE, UMAP operates under the assumption that the data is uniformly distributed across a locally connected Riemannian manifold, and crucially, that the Riemannian metric is either locally constant or exhibits approximate local constancy.

Dimension Reduction

In scenarios involving high-dimensional datasets, the application of dimension reduction techniques is typically undertaken as a preliminary step before employing algorithms such as the k-nearest neighbors (k-NN) algorithm. This preemptive measure is essential for mitigating the pervasive effects of the curse of dimensionality.[19]

The processes of feature extraction and dimension reduction can, in certain instances, be conflated into a single, unified step. This is often achieved through the application of established techniques like principal component analysis (PCA), linear discriminant analysis (LDA), canonical correlation analysis (CCA), or non-negative matrix factorization (NMF). These methods serve to pre-process the data, preparing it for subsequent clustering via k-NN, where the analysis is performed on feature vectors within this newly reduced-dimension space. Within the field of machine learning, this integrated process is also referred to as low-dimensional embedding.[20]

For datasets characterized by extreme dimensionality (for example, when conducting similarity searches on live video streams, analyzing DNA data, or processing high-dimensional time series), the implementation of fast, approximate k-NN search algorithms becomes the only truly viable option. These algorithms often leverage techniques such as locality-sensitive hashing, random projection,[21] or various "sketching" methodologies[22] – tools readily available from the extensive repository of the VLDB conference community.

Applications

A dimensionality reduction technique that occasionally finds its way into the discipline of neuroscience is the method of maximally informative dimensions.[23] This approach is dedicated to discovering a lower-dimensional representation of a dataset that succeeds in preserving the maximum possible amount of information contained within the original data.