home | library | feedback

Discretization

B-Course builds dependency models for categorical data. If the variables are not categorical, they are forced to be categorical by discretizing (categorizing) them to intervals.

B-Course will transform your numerical variables

The set of Naive Bayes models we consider expects the variables to be categorical (like gender, favorite color etc.). However, many times the variables are naturally numerical (like age) or the values at least have some natural order (like "strongly agree", "agree" "indifferent" "disagree" and "strongly disagree"). In these case we will categorize the variables which destroys all the information about numerical value and order. Continuous numerical variables are discretized into intervals and even the order of the intervals is forgotten. Also in ordered variables the information about the order is ignored.

Why this destruction of information?

The main reason to do this is that for categorical variables we do not have to make very strong distributional assumptions such as normality (Gaussian distribution) or even unimodality (meaning that the distribution curve has just a one bump in it) of the distributions. Distributions of categorical variables are also easier to understand, since it is usually sufficient to just count occurrences of values in a data matrix rather than counting sums and sums of squares (which of course is not that terribly difficult) and having exponent functions and such. With categorical variables we can also talk about probabilities instead of densities and have sums instead of integrals. All in all. Categorical variables are much easier creatures to handle with high school math.

What is lost due to this destruction of information?

Sometimes some statistical power. If the underlying distribution happens to be Gaussian, we may end up using more parameters than necessary. However, this is not usually a big problem.

Does this lead to situation where the analysis doesn't have enough data (and thus is not performed)?

No. The Bayesian approaches never acknowledge that they have not enough data. The Bayesian analysis takes into account all the data available, there are no preset sample sizes that have to be satisfied in order to be able to perform the classification analysis. In effect, if you have too small a sample, you should end up to have a model without any arcs. Well, that is a Naive Bayes model too.

Data formatting manually

If you have continuous variables you might want to discretize your data yourself beforehand to get a meaningful discretization. Many times certain discretization is meaningful because of some theoretical assumptions or previous studies about the subject. These things cannot be inferred automatically.

Does the discretization have effect on the results ?

It does. If you discretize your continuous variables to just few intervals you are more likely to get that variable connected to the class variable than if you discretize your values to very many intervals. Also the result may change if you change the division points that define your discretization.

So how can I find out if the discretization is good?

A good discretization for classification purposes is the one that leads to the models that have a good predictive performance i.e. models that will classify well future unclassified data vectors. At the moment B-Course does not try to solve this discretization problem but uses fixed discretization for continuous variables.



B-Course, version 2.0.0	CoSCo 2002