View/Print PDF, PS
 Module 5: Computing
By Cristian Ritz and Per Bruun Brockhoff

This chapter contains SAS lines for the methods applied in Section 5 in the companion chapter on classification. The SAS data set is denoted section6 in what follows.

The scatter plot of the two variables is obtained using proc gplot.

```proc gplot data=section6;
run;
quit;
```

where the construction y*x=class means that points (x,y) belonging to different classes as determined by the variable class are plotted using different plot symbols.

# 5.1 Linear and quadratic discriminant analysis

Quadratic discriminant analysis, as well as linear discriminant analysis and the nearest neighbour method, is carried out using the SAS procedure proc discrim.

```proc discrim data=section6 out=qdaout1 outstat=qdaout2
method=normal pool=no;
class type;
run;
```

The class membership is specified in the class line and the variables are specified in the var line. The options method=normal and pool=no specified that the populations are normal distributions whose variance-covariance matrix are not pooled, that is each normal distribution has its own variance-covariance matrix. The SAS data setqdaout1 contains the observations and their class membership according to the classification analysis, whereas the SAS data set qdaout2 contains the coefficients of the discriminant functions.

The SAS output contains the two-by-two table showing how many observations were correctly/wrongly classified, and more relevant output is found in the data sets qdaout1 qdaout2. For example, the coefficients of the quadratic discriminant functions are retrieved from qdaout2 using the following SAS lines.

```data coefficients;
set qdaout2;
run;
proc print data=coefficients;
run;
```

To obtain the sample variance-covariance matrices for the two populations use the option wcov. To test whether or not the matrices are identical (test the hypothesis ), proc discrim can be used with pool=test.

```proc discrim data=section6 method=normal pool=test wcov;
class type;
run;
```

The linear discriminant analysis is specified as follows.

```proc discrim data=section6 out=ldaout1 outstat=ldaout2
method=normal pool=yes;
class type;
run;
```

The only difference from specifying QDA is the option pool=yes. Adding the option crosslisterr results in a list of the misclassified observations. The coefficients of the linear discriminant functions are obtained using the SAS lines.

```data coefficients;
set ldaout2;
if _type_="LINEAR";
run;
proc print data=coefficients;
run;
```

Using the coefficients obtained from the QDA and LDA, the decision boundaries can be calculated and plotted (this only works in situations where there are two variables). The SAS data set decision1 contains the calculations of the decision boundary for QDA.

```data decision1;
do additive1=0 to 7.3 by 0.1;
type=3;
output;
end;
run;
```

The data set decision2 contains the calculations of the decision boundary for LDA.

```data decision2;
do additive1=0 to 8.3 by 0.1;
type=4;
output;
end;
run;
```

The two data sets and the original data set are combined into a single data set in the next SAS lines.

```data decision3;
set section6 decision1 decision2;
run;
```

Pay attention to the names of the variables constructed: they match the names of the variables in the original data set. Moreover more decimals than given in the companion chapter on classification are used in the calculation of the decision boundary of QDA, as the resulting boundary is sensitive to rounding.

Again proc gplot is used to obtain the plot of the original observations with decision boundaries added.

```symbol1 v=circle r=1;
symbol2 v=triangle r=1;
symbol3 v=none i=join r=1;
symbol4 v=none i=join r=1 line=2;

proc gplot data=decision3;
run;
quit;
```

A different plot symbol is used for each class (type of food) and each decision boundary.

# 5.2 Logistic regression and nearest neighbour

In order to do logistic regression the class levels are re-named to 0 and 1, respectively. This allows the SAS procedure proc genmod to be used directly.

```data section6b;
set section6;
if type=1 then newtype=0; else newtype=1;
run;
```

The logistic regression can be specified as follows.

```proc genmod data=section6b;
output out=out1 predicted=posterior;
run;
```

The option type3 makes the SAS output contain p-values of the terms in the model. The SAS lines result in the creation of a data set out1 containing the posterior probabilities (predicted=posterior). The class membership of an observation is determined by whether or not the posterior probability is smaller or larger than 0.5. A posterior probability below 0.5 implies membership of type 1 and a posterior probability above 0.5 means membership of type 2.

```data out2;
set out1;
if pred<0.5 then class=1; else class=2;
run;
```

The two-by-two table showing the numbers of correctly and wrongly classified observations is obtained using proc freq.

```proc freq data=out2;
tables class*type;
run;
```

Nearest neighbour classification with is obtained as follows.

```proc discrim data=section6 out=out1 method=npar k=3;
class type;
run;
```

The nearest neighbour method is specified by means of the options method=npar and k=3, where the second option determines the number of neighbours to be included in the majority vote on class membership.

The SAS output gives a table of the numbers of correctly and wrongly classified observations. The data set out1 contains the nearest neighbour classification of the observations.

# 5.3 Cross Validation

An important option NOT mentioned in the above is the option "Crossvalidate", used e.g. as:

```proc discrim data=section6 out=qdaout1 outstat=qdaout2
method=normal pool=no Crossvalidate;
class type;