home | library | feedback

Specification of the required data format

Like in many statistical applications B-Course expects the data to be in one big data table (sometimes called matrix).

B-Course expects a tab limited ASCII text file

B-Course does not make any attempt to support different data formats. That is why we try to be very explicit about the data format accepted by B-Course. The data is required to be in a tab limited ASCII text file with header line containing variable names. The lines after the header line contain the data one case (or one data vector or one observation or one unit or ..) per line. The missing values are denoted by blanks.

Gender        Weight      Fav team
boy           53          Lakers
boy                       76
girl          56          76
girl          45          Lakers
boy           51.5        Lakers
An example of the data in correct format

What is ASCII text file (and what is not)?

Trying hard to be non-technical ASCII text file is something that does not look like garbage when looked by simple text editor. In our case the data file is supposed to be organized so that it has rows and columns. To get a nice pictures and documents one is encouraged to use English alphabets in variable names and value names.

For example Microsoft Word documents, Excel-files and SPSS files are not text files (but they support saving in the text format).

What is row in a text file?

Most probably you do not have to care about this issue at all. The reason to address a simple question like this is that different systems (MS Windows, Mac OS, Linux, etc.) have different ways of marking the end of line in a text files. B-Course accepts three kinds of end-of-line markings. It accepts the Mac OS way to mark end of lines with carriage return (ASCII code 13), the DOS (and MS Windows) way of denoting end of lines with the carriage return followed by a newline (ASCII codes 13 an 10) and a Unix way of denoting end of lines with newline (ASCII code 10).

Header line

The first line of the data file should contain the names of the variables. The names can be any sequence of characters (English alphabets preferred) not containing tabulators (ASCII code 9). Tabulators are used to separate different variable names of each other. There is no limit to the length of the names but reasonably short names make the output of the B-Course look nicer. You cannot have two identical variable names. For obvious reasons the names containing only whitespace (like blanks) is not encouraged.

Data rows

The lines after header line (that is from line two on) should contain the data. Each line should contain exactly one data vector (that is all the variables of the single data vector). The values of the variables should be separated by tabulators.

How are the values coded

Values can be strings or numbers. Numbers are real numbers. Decimal point is supposed to be period (.). Negative numbers are denoted by preceding "-". Scientific notation of numbers (like 4e-12) is not supported. All the values containing only white space are interpreted to be missing values. Also the empty string (the "nothingness" between two consecutive tabulators in a file) denotes missing value. Variables containing nothing but missing values are not accepted.

 

  B-Course, version 2.0.0
CoSCo 2002