home | library | feedback

Classification modeling

Classification modeling uses classified data to build a model that can be used to infer the class of unclassified data.

What is classification modeling?

In classification modeling one column of your data matrix is selected as a class variable (and the rest are call predictor variables). To be a natural class variable, this variable should be categorical, but technically the variable can be any variable with small number of values. Example of the typical class variable is Gender (male, female), but technically there is nothing wrong to use for example discretized age ("younger than 21", "20 to 65", "over 65") as a class variable. Classification modeling means finding the model that given the values of predictor variables infers the value of the class variable. For example given the values of the variable "Likes Sports (yes, no)" and "Age", the model could come up with the prediction that there is 75% probability that the Gender of person is male. The classification modeling also gives a kind of test whether some classes are similar or not. If the classes can be correctly told apart by a model there must be some difference in those classes. Further analyzing the importance of the different variables in classification, we can also find what are the major differences in classes and how "significant" those differences are. Clearly if we can throw away a variable without harming our model's classification capabilities, the variable will not be one that makes a crucial difference (or alternatively, our model might have not gotten it right).

What is Bayesian in this classification modeling

In B-Course classification Bayesian theory gives us tools to merge many quantitative (parametric) models to build a single classification model. This is because being Bayesian makes it possible to talk about probabilities of the parameters and this way we can use results of many parametric models and merge those results by weighting result of different parametrizations by their probability. This so called model averaging yields good predictive model. "Classical" (frequentist) statistician is not allowed to speak about probabilities of parameters. Why? It is a philosophical issue, and related to the question of how to use the probability theory to answer our questions. It is also an issue that can be debated. If interested, you can find more information from the texts in the B-Course library.


  B-Course, version 2.0.0
CoSCo 2002