Table II

Glossary of useful statistical and machine-learning terms

1Generative Model—A probabilistic model in which observed samples can be viewed as having been generated by a sequence of events described by the model.
2Hidden Markov Model (HMM)—A type of graphical model (see below) used to model sequence data (e.g. amino acid sequences, continuous MS signal). The data is modeled as though it were generated by traversing through a sequence of “hidden” states (hidden because they are unknown to the user). In each hidden state visited, a “symbol” (e.g., an amino acid) is emitted. Efficient algorithms have been developed for HMMs that allow computations to be performed quickly.
3Graphical Model—A formalism used to efficiently model joint probabilities (e.g., the joint probability over all measured protein levels, cancer aggressiveness, age, and sex). Graphical models can be used for probabilistic supervised or unsupervised machine learning, and naturally accommodate missing values, as well as latent variables.
4Latent Variable—A variable in a model which is never directly observed. For example, the cluster number in a clustering algorithm can be viewed as a latent variable. Though we never observe its value, positing its existence makes it easier to make sense of the data. Its value is inferred (usually probabilistically) from the observed data.
5Learning—The stage of machine learning in which parameters of a model class are fit to the data. Also called training. Learning can be supervised or unsupervised.
6Supervised Learning—Learning in the case where class labels are known. (e.g., which samples belong to cancerous versushealthy people). For example, training a discriminative classifier uses supervised learning.
7Unsupervised Learning—Learning in the case where class labels are unknown. For example, clustering is unsupervised because no class labels are provided.
8Classifier—A model of data whos intended use is to make class predictions of unseen data. Sometimes called a discriminant model (because it discriminates between two classes).
9Generalization—The ability of a learned model to generalize to new data (i.e. will it work well on unseen data).
10Linear Discriminant Analysis (LDA)—A type of classifier in which feature values are linearly combined to obtain a class prediction.
11Linearly Separable—A two-class data set is said to be linearly separable if in the input space, a line (or in two dimensions, a plane, or in higher dimensions, a “hyper-plane”) can perfectly separate the two classes of data.
12Support Vector Machine (SVM)—A class of models that extend the notion of simple linear classifiers to more complex classifiers by projecting the input data into a user-selected, higher-dimensional space (the space is determined by the choice of “kernel”). Even if the data is not linearly separable in the original input space, it may be separable in the kernel space. SVMs are said to have good generalization bounds because of the principle of “margin maximization” at the core of their theoretical development; this principle states that of all the linear classifiers that can separate the input data, one should choose the one that lies furthest from all of the training points.
13Decision Tree—A classifier in which a new sample is classified by making a series of (usually binary) decisions, which correspond to a traversal through the nodes of the learned tree. The decision made by the tree amounts to following one of a fixed set of rules, such as: If sex = female AND osteopontin_level = high AND noFamilyHistory THEN Class = 1 (healthy), where each part of the rule corresponds to a node in the tree.
14Decision Stumps—A decision tree with only one node.
15Random Forest (RF)—A classifier consisting of many decision trees, each one trained with its own subset of the original data, chosen by sampling from the whole training set (with replacement). Classification is performed by having the trees vote on a new sample.
16K-Nearest Neighbor—One of the simplest possible classifiers in which the model consists of “memorizing” all training cases, and then predicting the class of a new sample to be the majority class of its k nearest neighbors.
17Receiver-Operating Characteristics (ROC) Curve—A plot used to measure the sensitivity/specificity trade-offs for a particular classifier. The area under the ROC curve is a measure of how well a classifier will perform over all possible sensitivities (or specificities).
18Boosting—A technique in which multiple instances (different parameter settings) of a simple classification model (e.g. a decision stump) are used together to provide a much stronger model. Each model instance is weighted by an amount related to how well it performed on a weighted subset of the training data. The data weights are in turn related to how well the previous combined set of models performed collectively on the data.
19Monte Carlo Methods—A general class of techniques whereby exact solutions are approximated by using randomly generated samples.
20Dynamic Programming (DP)—A class of algorithms used to solve certain kinds of optimization problems by storing the solution to subsets of the overall problem, and then using properties of how the subset solutions relate to the overall solution in an incremental way.
21Expectation-Maximization (EM)—An algorithm used to perform numerical optimization in the context of latent variable models. One iterates between estimating the value of the latent variables (in the E-Step), and fitting the model parameters (in the M-Step), until convergence.
22Dynamic Time Warping (DTW)—An algorithm used to align one sequence to another, using DP.
23Overfitting—The phenomenon in which a model that has too many free parameters relative to the amount of data, ends up fitting (after training) not only the true signal, but also noise and spurious correlations in the data. A model which has overfit will not make good predictions on new, unseen data (i.e., it will not generalize well).
24Regularization—A technique whereby constraints are added to the objective function so that the effective number of free parameters is reduced, and thus the capacity for overfitting is likewise reduced. For example, one of the simplest methods would be to force some of the free parameters to have the same value.
25Objective Function—A function set up for training, and which, during training, is maximized (or minimized, depending on the type of function). For example, a typical objective function might be amount of error produced by a classifier on the training examples (data samples used during training), which would be minimized.
26Shrinkage (e.g. Nearest Shrunken Centroid, NSC)—A type of regularization whereby class-specific centroids are “shrunk” toward the overall (nonclass-specific) centroid. This has the effect of eliminating the influence of the most weakly correlated features, thereby reducing the capacity to overfit.
27Lasso—A regularization method whereby the absolute value of the parameters is included in the objective function so as to pressure the absolute value of the parameters to be small (this has been shown to encourage a reduced, effective number of features).
28Hierarchical Clustering—A clustering algorithm in which the two closest (as defined by some similarity measure) samples (e.g. one sample might be all protein levels measured in serum from one patient) are merged to form a node. Progressively, the next two samples (or nodes, or sample/node) are merged into a node, until all samples are in some node. The sequence in which samples are merged into nodes defines a set of hierarchical clusters. By choosing a minimal level of similarity at which to cut off the merging process, a single clustering of the data can be obtained.
29Feature Selection—A step in machine-learning methods in which certain features are chosen to be included in a model, while others are chosen to be omitted.
30Wrapper—A feature selection technique in which features are selected in conjunction (and interactively) with the model being trained. This allows the feature selection process to be tailored to the model at hand.
31Filter (in the context of feature selection)—A feature selection technique in which features are selected before training of a model, and independently of the model, for example using a t-test to rank features and then choosing the highest ranking N of them. This is distinct from the meaning of filtering in signal processing where filtering refers to application of a digital filter to the data in order to reduce noise or otherwise change the properties of the signal.
32Principal Components Analysis (PCA)—A technique commonly used to reduce the dimensionality of data. This is accomplished by finding a new space for the data to lie in, in which the new coordinates are linear combinations of the old ones, and in which the new coordinates (called principal components or PCs) are ordered by the amount of variance for which they account in the original data space. The first PC accounts for the most variation in the original data, and each next PC accounts for progressively less. PCs near the bottom frequently account for almost no variance, and hence can be omitted, producing fewer coordinates, and therefore a lower dimensionality. If one has a point in PCA space, one can “back-project” it to see what point it corresponds to in the original space.
33Unfolded PCA—In standard PCA, each sample is a vector of values. If one has a matrix of values (e.g. in the case of one LC-MS data), one can “unfold” the matrix into a vector. This allows standard application of PCA, but throws away some of the information conveyed by storage in a matrix.
34Bayesian Methods—While many people consider Bayesian methods to be the integration of prior knowledge into the learning task by way of Bayes’ Rule, a more complete description would be centered around the following: Whereas conventional training methods seek to find the single best set of parameters that fit the training data, Bayesian methods seek to represent the uncertainty about the relationships being learned. A practical implication of this, for instance, would be in making predictions from a learned model. In non-Bayesian methods, after learning, one would use a single set of learned parameters to make predictions, whereas in Bayesian prediction, one would integrate out over all possible sets of parameters, weighted by their probabilities (where the probabilities are learned during training) [www.faqs.org/faqs/ai-faq/neural-nets/part3/section-7.html].
35Cross-validation—A method for making the most use of a dataset for both learning and validation. Rather than separating the data into a single learning set (called the “training” set) and a single test set, n-fold cross-validation separates the data into ntraining sets and ntest sets. If nwere equal to 5, cross-validation would work as follows: The entire dataset would be divided into five equal-sized groups. The first four groups would be used as training data, and the fifth as test data. The second through to fifth groups would then be used as training data and the first group as test data. This procedure is continued until each group has been used as test data. The aggregate test results from all n = 5 phases of the cross-validation would be used to obtain a final estimate of the predictive accuracy. Cross-validation provides an estimate of how a particular model might do on a new, unseen data set drawn from the same statistical distribution.
36Bootstrap—A technique to assess the variability of a classifier’s predictive ability (or other statistical quantities) by repeatedly measuring the classifier’s predictive ability, using a different subset of the data (chosen by sampling uniformly from the original data set, with replacement) each time.
37Wavelet Decomposition—A transformation used in signal processing that breaks down a signal into both 1) local features (i.e., spanning only a few adjacent m/z values) and 2) frequencies.
38F Statistic—A test statistic sometimes used to rank features for feature selection. It measures the ratio of between-class variance to within-class variance. Higher F statistics indicate more class-informative features.
39Coefficient of Variation (CV)—Defined to be the variance divided by the mean. This is useful if one is measuring how variable two different features are when measured on different scales.
40Mahalanobis Distance—A distance measure (as opposed to say a Euclidean distance), in which the variation (e.g. the noise or scale) of each variable is taken into account, as well as the correlation between variables.
41Genetic Algorithm—An algorithm in which many features subsets are chosen at random, and their predictive power assessed (in the context of some model, for example, a linear discriminant). In a process of “natural selection,” those feature subsets deemed good (or “fit”) are “mutated” and “crossed” (i.e. various operations are performed that change and combine good feature subsets) to produce new feature subsets. This process is continued iteratively until some preset threshold, for example the predictive power on the training set, is met.
42Fitness Function—The function that measures the “goodness” of particular feature subsets in the context of genetic algorithms.