home | library | feedback

Calculating the predictive distribution

Training the classifier and calculating the predictive distribution is actually rather simple. The only thing that is actually required is to calculate certain frequencies (numbers of occurrences) in the data matrix. Below we try to explain the details.

Model as a variable selector and discretizer

In addition to the independence statements the B-Course NB-models can be seen to have two important functions: variable selection and discretization. While B-Course currently uses fixed discretization (i.e. it does not try to find optimal discretization), the theory below applies to any discretization. Applying model M to the original data D can be described as a transformation from D to the new data set DM. DM has the same number of rows as the original data D, but it may have deleted some columns, thus selecting only some of the variables. Furthermore, if some of the selected variables are numerical, transforming D to DM replaces the original numbers by index of the interval. For example if the column in original data contains numbers 1.2, 4.0 3.2 and the discretization of this column in a model M is [0,1], ]1,2], ]2,3] and ]3,4], the corresponding entries in DM are 2, 4 and 4 (we start indexing from 1). Also, when the variable is not continuous (like class variable), model M enumerates the values of the variable, for example values of the class variable will be numerated from number one on: 1, 2, ...

The class column of DM is identical to that of the original data D and we denote it by by VC. When there is n columns in DM, we name them V1, V2, ..., Vn. We will also need to talk about the number of different values in columns of DM. The number of classes (i.e. number of different values in column VC) is denoted by |VC| and the number of values in ith predictor variable column is denoted by |Vi| (e.g. the fourth column has |V4| different values namely 1, 2, ... and |V4|.

Some math

Now some mathematics. Let us denote the the classification model with letter M and our classified training data with D. We now denote the unclassified data vector with v = (v1,v2,...,vn). Our task is now to calculate the probability that v belongs to the class c (where c is a natural number between 1 and |VC|):

P(VC=c | V1=v1, V2=v2, ..., Vn=vn, D, M).

We often use a little bit shorter notation:

P(VC=c | v, D, M).

This probability is easy to calculate using the definition of conditional probability :

P(VC=c | v, DM) = P(VC=c, v | DM) / P(v | DM)

We now notice that the denominator P(v | DM) does not depend on c at all, so that in order to calculate the ratio of these probabilities for different c:s (for example how many times more/less probable it is that v belongs to class 1 than that it belongs to class 2), we only need to calculate the nominators.

Notations for calculating P(VC=c, v | DM)

To get the necessary numbers to calculate this probability we first remove from the DM that contains N data vectors all but those rows that contain data vectors belonging to class 1. We call the remaining data matrix that only contains class 1 data vectors DC=1. We now calculate the rows in DC=1 and denote that number with N1 i.e. N1 out of N data vectors in our training data belong to the class 1. We then investigate the other columns of DC=1. We count the number of rows in DC=1 that have value 1 in column V1 and call this number N1,1,1 (the first 1 is telling that we use data matrix DC=1 and the first subscript (1) tells that we look at the column V1 and the second subscript (1) tells that we counted rows having 1. Similarly we count the number of rows having value 2 in column V1 and we get the number N1,1,2 and so on. Notice that we may have missing values so that N1,1,1 + N1,1,2 + . . . + N1,1,|V1| does not necessarily sum up to N1, but up to something we denote with N1,1 (if there is no missing data, then actually N1,1 = N1.

We then proceed to do some counting on column V2 of DC=1 and we get numbers N1,2,1, N1,2,2, ..., N1,2,|V2|, and their sum N1,2. We repeat this thing for all the predictor columns of DC=1.

When all the predictor columns of DC=1 have been examined, we return to our "original" data matrix DM, but this time we remove all but those rows that have value 2 in their class column. The number of rows in remaining data matrix DC=2 is denoted with N2. We then proceed to investigate the predictor columns of this data matrix exactly like we did for the data matrix DC=1. After that we extract data set DC=3 at handle that like we did DC=1 and DC=2.

Now let us review the denotations.

Calculating P(VC=c, v | DM)

And now we ready to write the formula to calculate this probability:

P(VC=c, v | DM) = P(VC=c | DM) * P(v | VC=c, DM)
= P(VC=c | DM) * P(V1=v1 | VC=c, DM) * P(V2=v2 | VC=c, DM) * ... * P(Vn=vn | VC=c, DM).
= Nc + 1

N + |VC|
* Nc,1,v1 + 1

Nc,1 + |V1|
* Nc,2,v2 + 1

Nc,2 + |V2|
* ... * Nc,n,vn + 1

Nc,n + |Vn|

The last equation, where probabilities are estimated using frequencies calculated from DM takes a little bit more involved Bayesian math, since the calculation is the result of integrating out the multinomial parameters with uniform prior. However, the resulting probabilities are simple to understand as little bit smoothed relative frequencies. Intuitively, the smoothing is due to the fact that based on the finite sample Bayesians are reluctant to give zero probabilities to any events although there were none of those present in the sample.

What if our query has missing data.

If some (or all) variables of the the unclassified vector v are missing we simply modify the formula above by removing all the the terms P(Vi=vi | VC=c, DM) where value for Vi is missing. For example, if v only has values for variables V1, V3 and V7, we can calculate

P(VC=c, v | DM) = P(VC=c | DM) * P(v | VC=c, DM)
= P(VC=c | DM) * P(V1=v1 | VC=c, DM) * P(V3=v3 | VC=c, DM) * P(V7=v7 | VC=c, DM).
= Nc + 1

N + |VC|
* Nc,1,v1 + 1

Nc,1 + |V1|
* Nc,3,v3 + 1

Nc,3 + |V3|
* Nc,7,v7 + 1

Nc,n + |V7|

 

  B-Course, version 2.0.0
CoSCo 2002