home | library | feedback |
B-Course tries to find a model that has the best possible predictive accuracy i.e. the model that will best classify the future unclassified data vectors. But is it really possible to pick the model that will do something best in the future? After all, you never know about the future.
In order to understand the estimation of the predictive accuracy of the classification model, it is beneficial to understand the classification model (or classifier for short) to be a machine that gets an unclassified data vector as an input and gives a guess (prediction) of the class as an output. For example, we might insert the data vector ("likes sports", "7 years old", "urban") to the classifier and out might pop the prediction of gender, "boy" for example. However, before the classifier is functional, it must be trained with classified data vectors. That is all it takes to be a classifier. Or actually we might want our machine to be deterministic so that every time we train the classifier with the same training data and insert the same data vector, we get out the same prediction.
To know the gory details you better follow the link below, but for explaining the estimation of predictive accuracy, it is enough to know that the essential part of building classifiers is feeding them classified data (Yes, classifiers eat little boys and girls.) i.e. training them.
»
Calculating predictive distribution
Leave-one-out (LOO) cross-validation is a method to estimate the predictive accuracy of the classifier. In LOO we remove the data vectors one in a time from the data matrix containing N vectors and feed the classifier with the remaining N-1 vectors. This is how the classifier is trained, by feeding it N-1 data vectors. When classifier has chewed all the N-1 classified data vectors, we take the data vector we just removed from the data matrix and conceal its class so that for classifier it looks like an unclassified data vector. We now ask the classifier to classify this data by inserting the unclassified version of the removed data vector and if out pops the correct class (the one that we had concealed), the classifier gets one point. If out pops a wrong answer, the classifier is rewarded by no points at all. After this we return the removed data vector to the data matrix and repeat this procedure by removing some other vector. (It is simplest of course to first remove the first data vector and then the second one and so on.) We repeat this "remove - feed in rest - classify removed"-game for each data vector in our matrix and sum up the points the classifier gains. After that we divide the points by the number of data vectors to get the average predictive accuracy i.e. how many percents of the classifications went right.) This is our estimate of the classification accuracy of the classifier.
If in each round of LOO, we feed the classifier with different N-1 data vectors, don't we get also N differently trained versions of the classifier? We do, so what is exactly the classifier that is under evaluation here? It is those parts of the machine that stay unchanged no matter what we feed it with. In B-Course the classifiers differ from each other by using different predictor variables. The set of predictors in classifiers do not change no matter what data vectors those classifiers are trained with, thus the LOO evaluates this stable part of classifiers.
To build the final version of classifier B-Course selects the structure (i.e. set of predictors) of the classifier that performed best in LOO. This structure is then trained with the whole data matrix i.e. with all N classified data vectors.
Actually, the output of B-Course classifier is a little bit more sophisticated than what we just described above. Instead of just popping out the prediction of the class value, the classifier actually pops out the distribution on class values i.e. it also tells the certainty of the classification. Instead of popping out just "boy" it says "boy"(78%), "girl"(22%). Also, if there are more than two classes, it gives probabilities for them all. Now the crude classifier described earlier simply finds the most probable class and selects that for a prediction. If there are several most probable classes, the first one in list is selected.
With probabilistic classifiers we may also use more sophisticated criterion to evaluate the performance of the classifier than just counting the relative frequency of correct predictions in LOO. It is natural to reward the classifier more if its correct answers are predicted with high certainty. We would also like to think that the classifier should be penalized more severely if it was certain of its prediction, but the prediction turned out to be wrong, than if it was very uncertain and then predicted wrong. One simple scoring system is the following: When for each prediction during LOO, the classifier gives the probabilities to different classes, the classifier is rewarded by the (fractional) point that equal to the probability of the correct (!) class. For example, if the classifier gives the prediction: "boy"(78%), "girl"(22%) and the "girl" is the correct answer, the classifier gains 0.22 points for this answer. At the end we again take the average of the points gained per prediction. It is customary to take geometric rather than arithmetic average, i.e. all the N points are multiplied together and we then take Nth root of the product. (Taking the logarithm of this number and multiplying the result by minus one is often called log-score.)
B-Course, version 2.0.0 |