home | library | feedback

Calculating the predictive distribution

Training the classifier and calculating the predictive distribution is actually rather simple. The only thing that is actually required is to calculate certain frequencies (numbers of occurrences) in the data matrix. Below we try to explain the details.

Model as a variable selector and discretizer

In addition to the independence statements the B-Course NB-models can be seen to have two important functions: variable selection and discretization. While B-Course currently uses fixed discretization (i.e. it does not try to find optimal discretization), the theory below applies to any discretization. Applying model M to the original data D can be described as a transformation from D to the new data set D^M. D^M has the same number of rows as the original data D, but it may have deleted some columns, thus selecting only some of the variables. Furthermore, if some of the selected variables are numerical, transforming D to D^M replaces the original numbers by index of the interval. For example if the column in original data contains numbers 1.2, 4.0 3.2 and the discretization of this column in a model M is [0,1], ]1,2], ]2,3] and ]3,4], the corresponding entries in D^M are 2, 4 and 4 (we start indexing from 1). Also, when the variable is not continuous (like class variable), model M enumerates the values of the variable, for example values of the class variable will be numerated from number one on: 1, 2, ...

The class column of D^M is identical to that of the original data D and we denote it by by V_C. When there is n columns in D^M, we name them V₁, V₂, ..., V_n. We will also need to talk about the number of different values in columns of D^M. The number of classes (i.e. number of different values in column V_C) is denoted by |V_C| and the number of values in i^th predictor variable column is denoted by |V_i| (e.g. the fourth column has |V₄| different values namely 1, 2, ... and |V₄|.

Some math

Now some mathematics. Let us denote the the classification model with letter M and our classified training data with D. We now denote the unclassified data vector with v = (v₁,v₂,...,v_n). Our task is now to calculate the probability that v belongs to the class c (where c is a natural number between 1 and |V_C|):

P(V_C=c | V₁=v₁, V₂=v₂, ..., V_n=v_n, D, M).

We often use a little bit shorter notation:

P(V_C=c | v, D, M).

This probability is easy to calculate using the definition of conditional probability :

P(V_C=c | v, D^M) = P(V_C=c, v | D^M) / P(v | D^M)

We now notice that the denominator P(v | D^M) does not depend on c at all, so that in order to calculate the ratio of these probabilities for different c:s (for example how many times more/less probable it is that v belongs to class 1 than that it belongs to class 2), we only need to calculate the nominators.

Notations for calculating P(V_C=c, v | D^M)

To get the necessary numbers to calculate this probability we first remove from the D^M that contains N data vectors all but those rows that contain data vectors belonging to class 1. We call the remaining data matrix that only contains class 1 data vectors D_C=1. We now calculate the rows in D_C=1 and denote that number with N₁ i.e. N₁ out of N data vectors in our training data belong to the class 1. We then investigate the other columns of D_C=1. We count the number of rows in D_C=1 that have value 1 in column V₁ and call this number N_1,1,1 (the first 1 is telling that we use data matrix D_C=1 and the first subscript (1) tells that we look at the column V₁ and the second subscript (1) tells that we counted rows having 1. Similarly we count the number of rows having value 2 in column V₁ and we get the number N_1,1,2 and so on. Notice that we may have missing values so that N_1,1,1 + N_1,1,2 + . . . + N_1,1,|V₁| does not necessarily sum up to N₁, but up to something we denote with N_1,1 (if there is no missing data, then actually N_1,1 = N₁.

We then proceed to do some counting on column V₂ of D_C=1 and we get numbers N_1,2,1, N_1,2,2, ..., N_1,2,|V₂|, and their sum N_1,2. We repeat this thing for all the predictor columns of D_C=1.

When all the predictor columns of D_C=1 have been examined, we return to our "original" data matrix D^M, but this time we remove all but those rows that have value 2 in their class column. The number of rows in remaining data matrix D_C=2 is denoted with N₂. We then proceed to investigate the predictor columns of this data matrix exactly like we did for the data matrix D_C=1. After that we extract data set D_C=3 at handle that like we did D_C=1 and D_C=2.

Now let us review the denotations.

D^M is our M-transformed classified discrete training data having N data vectors and n predictor variables.
V_i is the i^th column of D^M and it has |V_i| different values. We denote the class column of D^M with V_C and |V_C| is the number of classes.
D_C=c is the data matrix containing exactly those rows of the data matrix D^M, that have c in their class column. D_C=c has N_c rows.
N_c,i,l is the number of those rows in data matrix D_C=c, that have value l in the i^th predictor variable. The sum N_c,i,1 + N_c,i,2 + ... + N_{c,i,|V_i|} is denoted by N_c,i.

Calculating P(V_C=c, v | D^M)

And now we ready to write the formula to calculate this probability:

P(V_C=c, v \| D^M)	=	P(V_C=c \| D^M)	*	P(v \| V_C=c, D^M)
	=	P(V_C=c \| D^M)	*	P(V₁=v₁ \| V_C=c, D^M)	*	P(V₂=v₂ \| V_C=c, D^M)	* ... *	P(V_n=v_n \| V_C=c, D^M).
	=	N_c + 1 N + \|V_C\|	*	N_c,1,v₁ + 1 N_c,1 + \|V₁\|	*	N_c,2,v₂ + 1 N_c,2 + \|V₂\|	* ... *	N_{c,n,v_n} + 1 N_c,n + \|V_n\|

The last equation, where probabilities are estimated using frequencies calculated from D^M takes a little bit more involved Bayesian math, since the calculation is the result of integrating out the multinomial parameters with uniform prior. However, the resulting probabilities are simple to understand as little bit smoothed relative frequencies. Intuitively, the smoothing is due to the fact that based on the finite sample Bayesians are reluctant to give zero probabilities to any events although there were none of those present in the sample.

What if our query has missing data.

If some (or all) variables of the the unclassified vector v are missing we simply modify the formula above by removing all the the terms P(V_i=v_i | V_C=c, D^M) where value for V_i is missing. For example, if v only has values for variables V₁, V₃ and V₇, we can calculate

P(V_C=c, v \| D^M)	=	P(V_C=c \| D^M)	*	P(v \| V_C=c, D^M)
	=	P(V_C=c \| D^M)	*	P(V₁=v₁ \| V_C=c, D^M)	*	P(V₃=v₃ \| V_C=c, D^M)	*	P(V₇=v₇ \| V_C=c, D^M).
	=	N_c + 1 N + \|V_C\|	*	N_c,1,v₁ + 1 N_c,1 + \|V₁\|	*	N_c,3,v₃ + 1 N_c,3 + \|V₃\|	*	N_c,7,v₇ + 1 N_c,n + \|V₇\|



B-Course, version 2.0.0	CoSCo 2002

P(V_C=c, v \| D^M)	=	P(V_C=c \| D^M)	*	P(v \| V_C=c, D^M)
	=	P(V_C=c \| D^M)	*	P(V₁=v₁ \| V_C=c, D^M)	*	P(V₂=v₂ \| V_C=c, D^M)	* ... *	P(V_n=v_n \| V_C=c, D^M).
	=	N_c + 1 N + \|V_C\|	*	N_c,1,v₁ + 1 N_c,1 + \|V₁\|	*	N_c,2,v₂ + 1 N_c,2 + \|V₂\|	* ... *	N_{c,n,v_n} + 1 N_c,n + \|V_n\|

P(V_C=c, v \| D^M)	=	P(V_C=c \| D^M)	*	P(v \| V_C=c, D^M)
	=	P(V_C=c \| D^M)	*	P(V₁=v₁ \| V_C=c, D^M)	*	P(V₃=v₃ \| V_C=c, D^M)	*	P(V₇=v₇ \| V_C=c, D^M).
	=	N_c + 1 N + \|V_C\|	*	N_c,1,v₁ + 1 N_c,1 + \|V₁\|	*	N_c,3,v₃ + 1 N_c,3 + \|V₃\|	*	N_c,7,v₇ + 1 N_c,n + \|V₇\|