home | library | feedback

Discretization

B-Course builds dependency models for categorical data. If the variables are not categorical, they are forced to be categorical by discretizing (categorizing) them to intervals.

B-Course will transform your numerical variables

The set of dependency models we consider expects the variables to be categorical (like gender, favorite color etc.). However, many times the variables are naturally numerical (like age) or the values at least have some natural order (like "strongly agree", "agree" "indifferent" "disagree" and "strongly disagree"). In these case we will categorize the variables which destroys all the information about numerical value and order. Continuous numerical variables are discretized into intervals and even the order of the intervals is forgotten. Also in ordered variables the information about the order is ignored.

Why this destruction of information?

The main reason to do this is that for categorical variables we can build models that capture "non-linear" relationships between variables. We also get rid of the suspicious distributional assumptions (like multivariate normality). The advantage is less assumptions and possibility to find more complex relationships

What is lost due to this destruction of information?

Mainly statistical power. If the relationship between variables happen to be linear, linear models will find out that relationship with less data than B-Course (which is not surprising since linear models are naturally good in detecting linear dependencies, otherwise they would be totally useless).

Does this lead to situation where the analysis doesn't have enough data (and thus is not performed)?

No. The Bayesian approaches never acknowledge that they have not enough data. The Bayesian analysis takes into account all the data available, there are no preset sample sizes that have to be satisfied in order to be able to perform the dependency analysis.

ť Read more on size of data

Data formatting manually

If you have continuous variables you might want to discretize your data yourself beforehand to get a meaningful discretization. Many times certain discretization is meaningful because of some theoretical assumptions or previous studies about the subject. These things cannot be inferred automatically.

Does the discretization have effect on the results ?

It does. If you discretize your continuous variables to just few intervals you are likely to find out more dependencies than if you discretize your values to very many intervals. Also the result may change if you change the division points that define your discretization.

So how can I find out if the discretization is good?

Very natural question. We do not know how. The problem is that it is not that clear what "goodness of discretization" actually means. One attempt to answer the question in the context of dependency modeling is that good discretization is a discretization that leads to the detection of true dependencies and true independencies between variables. Sounds good, but it is out of our current understanding to know how to find these good discretizations. B-Course takes a simple approach and discretizes to very view intervals. This way you should not need very much data to find out dependencies.

The canonical Bayesian answer could be that good discretization is a discretization that is very probable. However, it is not clear that discretization can be handled as something we do not know, but that still is there. Rather it is an artifact that has been created, so the probability of discretization is not necessarily a meaningful concept. Theoretically, discretization cannot create new dependencies, only destroy them. In this sense, it might be meaningful to search for discretizations that reveal as many dependencies as possible. At the moment B-Course does not attempt this sophisticated approach.



B-Course, version 2.0.0	CoSCo 2002