home | library | feedback |
Training the classifier and calculating the predictive distribution is actually rather simple. The only thing that is actually required is to calculate certain frequencies (numbers of occurrences) in the data matrix. Below we try to explain the details.
In addition to the independence statements the B-Course NB-models can be seen to have two important functions: variable selection and discretization. While B-Course currently uses fixed discretization (i.e. it does not try to find optimal discretization), the theory below applies to any discretization. Applying model M to the original data D can be described as a transformation from D to the new data set D^{M}. D^{M} has the same number of rows as the original data D, but it may have deleted some columns, thus selecting only some of the variables. Furthermore, if some of the selected variables are numerical, transforming D to D^{M} replaces the original numbers by index of the interval. For example if the column in original data contains numbers 1.2, 4.0 3.2 and the discretization of this column in a model M is [0,1], ]1,2], ]2,3] and ]3,4], the corresponding entries in D^{M} are 2, 4 and 4 (we start indexing from 1). Also, when the variable is not continuous (like class variable), model M enumerates the values of the variable, for example values of the class variable will be numerated from number one on: 1, 2, ...
The class column of D^{M} is identical to that of the original data D and we denote it by by V_{C}. When there is n columns in D^{M}, we name them V_{1}, V_{2}, ..., V_{n}. We will also need to talk about the number of different values in columns of D^{M}. The number of classes (i.e. number of different values in column V_{C}) is denoted by |V_{C}| and the number of values in i^{th} predictor variable column is denoted by |V_{i}| (e.g. the fourth column has |V_{4}| different values namely 1, 2, ... and |V_{4}|.
Now some mathematics. Let us denote the the classification model with letter M and our classified training data with D. We now denote the unclassified data vector with v = (v_{1},v_{2},...,v_{n}). Our task is now to calculate the probability that v belongs to the class c (where c is a natural number between 1 and |V_{C}|):
P(V_{C}=c | V_{1}=v_{1}, V_{2}=v_{2}, ..., V_{n}=v_{n}, D, M).
We often use a little bit shorter notation:
P(V_{C}=c | v, D, M).
This probability is easy to calculate using the definition of conditional probability :
P(V_{C}=c | v, D^{M}) = P(V_{C}=c, v | D^{M}) / P(v | D^{M})
We now notice that the denominator P(v | D^{M}) does not depend on c at all, so that in order to calculate the ratio of these probabilities for different c:s (for example how many times more/less probable it is that v belongs to class 1 than that it belongs to class 2), we only need to calculate the nominators.
To get the necessary numbers to calculate this probability we first remove from the D^{M} that contains N data vectors all but those rows that contain data vectors belonging to class 1. We call the remaining data matrix that only contains class 1 data vectors D_{C=1}. We now calculate the rows in D_{C=1} and denote that number with N_{1} i.e. N_{1} out of N data vectors in our training data belong to the class 1. We then investigate the other columns of D_{C=1}. We count the number of rows in D_{C=1} that have value 1 in column V_{1} and call this number N_{1,1,1} (the first 1 is telling that we use data matrix D_{C=1} and the first subscript (1) tells that we look at the column V_{1} and the second subscript (1) tells that we counted rows having 1. Similarly we count the number of rows having value 2 in column V_{1} and we get the number N_{1,1,2} and so on. Notice that we may have missing values so that N_{1,1,1} + N_{1,1,2} + . . . + N_{1,1,|V1|} does not necessarily sum up to N_{1}, but up to something we denote with N_{1,1} (if there is no missing data, then actually N_{1,1} = N_{1}.
We then proceed to do some counting on column V_{2} of D_{C=1} and we get numbers N_{1,2,1}, N_{1,2,2}, ..., N_{1,2,|V2|}, and their sum N_{1,2}. We repeat this thing for all the predictor columns of D_{C=1}.
When all the predictor columns of D_{C=1} have been examined, we return to our "original" data matrix D^{M}, but this time we remove all but those rows that have value 2 in their class column. The number of rows in remaining data matrix D_{C=2} is denoted with N_{2}. We then proceed to investigate the predictor columns of this data matrix exactly like we did for the data matrix D_{C=1}. After that we extract data set D_{C=3} at handle that like we did D_{C=1} and D_{C=2}.
Now let us review the denotations.
And now we ready to write the formula to calculate this probability:
P(V_{C}=c, v | D^{M}) | = | P(V_{C}=c | D^{M}) | * | P(v | V_{C}=c, D^{M}) | ||||
= | P(V_{C}=c | D^{M}) | * | P(V_{1}=v_{1} | V_{C}=c, D^{M}) | * | P(V_{2}=v_{2} | V_{C}=c, D^{M}) | * ... * | P(V_{n}=v_{n} | V_{C}=c, D^{M}). | |
= | N_{c} + 1 N + |V_{C}| |
* | N_{c,1,v1} + 1 N_{c,1} + |V_{1}| |
* | N_{c,2,v2} + 1 N_{c,2} + |V_{2}| |
* ... * | N_{c,n,vn} + 1 N_{c,n} + |V_{n}| |
The last equation, where probabilities are estimated using frequencies calculated from D^{M} takes a little bit more involved Bayesian math, since the calculation is the result of integrating out the multinomial parameters with uniform prior. However, the resulting probabilities are simple to understand as little bit smoothed relative frequencies. Intuitively, the smoothing is due to the fact that based on the finite sample Bayesians are reluctant to give zero probabilities to any events although there were none of those present in the sample.
If some (or all) variables of the the unclassified vector v are missing we simply modify the formula above by removing all the the terms P(V_{i}=v_{i} | V_{C}=c, D^{M}) where value for V_{i} is missing. For example, if v only has values for variables V_{1}, V_{3} and V_{7}, we can calculate
P(V_{C}=c, v | D^{M}) | = | P(V_{C}=c | D^{M}) | * | P(v | V_{C}=c, D^{M}) | ||||
= | P(V_{C}=c | D^{M}) | * | P(V_{1}=v_{1} | V_{C}=c, D^{M}) | * | P(V_{3}=v_{3} | V_{C}=c, D^{M}) | * | P(V_{7}=v_{7} | V_{C}=c, D^{M}). | |
= | N_{c} + 1 N + |V_{C}| |
* | N_{c,1,v1} + 1 N_{c,1} + |V_{1}| |
* | N_{c,3,v3} + 1 N_{c,3} + |V_{3}| |
* | N_{c,7,v7} + 1 N_{c,n} + |V_{7}| |
B-Course, version 2.0.0 |