7 High-dimensional data
Last updated: 16 Oct, 2024 12:49 PM EDT
Introduction
High-dimensional data refersto data where the number of variables is greater than the number of observations. This kind of data started appearing in the statistical literature when high through genetic experiments started being conducted in collaborative labs. This initially included DNA microarraand subsequently included various kinds of omics data.
The challenge that high emotional data poses to statistical evaluation is that almost all statistical methods are predicated on the assumption that the number of observations is greater than the number of predictors. This requirement, often expressed as \(n>p\), ensures that various statistical models can be uniquely and identifiably For example, in the linear model, this assumption allows us to invert the design matrix in order to solve the usual normal equations And get identifiable estimates of the weight parameters.
There are two general areas in which research in high–dimensional data concentrated; feature selection and multiple testing. There has also been interest in dimension reduction techniques, so that a tractable number of variables can be interrogated rather than the full and large set of variables on whom data is collected. In this section, we will survey each of these areas.
Issues
Before we delve into methods for high dimensional data, let us first discuss some of the issues that high dimensional data poses. One issue already identified is in computational tractability. The underlying issue here is information content. Think about the observation matrix \(X_{n\times p}\). Recall from linear algebra that The rank of a matrix is the number of independent rows or independent columns in that matrix. Since statistics usually organizes data so that the variables are columns, what we typically desire is that the variables we observe are independent, or the very least unrelated, So that we cannot derive the values of one column from a set of other columns. If we could derive the values, we actually don’t need that column because the information is contained in other columns. For a data matrix, therefore, we would like this matrix to have full column rank, that is, the rank of the matrix \(X\) should be \(p\). We also know, as a fact, that the rank of a matrix cannot be larger than either the number of rose or the number of columns in the matrix. If \(n < p\), the maximum rank of the data matrix can be \(n\), And so mathematically, you cannot have a complete set of independent columns. The information content is based on the number of observations and not the number of variables in this situation.
Dimension reduction techniques
Dimension reduction techniques refer to methods by which we can reduce the corpus of data to a small set of variables that capture the information contained in the larger set. The challenge here is to not lose too much information in this process. Some examples of these techniques include principal components analysis, independent, components analysis, t-SNE, UMAP and, in more recent times, variational autoencoders.