Table of Contents
This chapter contains SAS lines for the methods applied in Section 5 in the companion chapter on classification. The SAS data set is denoted
section6 in what follows.
The scatter plot of the two variables is obtained using proc gplot.
proc gplot data=section6;
plot additive2*additive1=type;
run;
quit;
where the construction y*x=class means that points (x,y) belonging to different classes as determined by the variable
class are plotted using different plot symbols.
Quadratic discriminant analysis, as well as linear discriminant analysis and the nearest neighbour method, is carried out using the SAS procedure
proc discrim.
proc discrim data=section6 out=qdaout1 outstat=qdaout2
method=normal pool=no;
class type;
var additive1 additive2;
run;
The class membership is specified in the class line and the variables are specified in the var line. The options
method=normal and pool=no specified that the populations are normal distributions whose variancecovariance matrix are not
pooled, that is each normal distribution has its own variancecovariance matrix. The SAS data setqdaout1 contains the observations and
their class membership according to the classification analysis, whereas the SAS data set qdaout2 contains the coefficients of the
discriminant functions.
The SAS output contains the twobytwo table showing how many observations were correctly/wrongly classified, and more relevant output is found in
the data sets qdaout1 qdaout2. For example, the coefficients of the quadratic discriminant functions are retrieved from
qdaout2 using the following SAS lines.
data coefficients;
set qdaout2;
if _type_="QUAD";
run;
proc print data=coefficients;
run;
To obtain the sample variancecovariance matrices for the two populations use the option wcov. To test whether or not the matrices are
identical (test the hypothesis
), proc discrim can be used with pool=test.
proc discrim data=section6 method=normal pool=test wcov;
class type;
var additive1 additive2;
run;
The linear discriminant analysis is specified as follows.
proc discrim data=section6 out=ldaout1 outstat=ldaout2
method=normal pool=yes;
class type;
var additive1 additive2;
run;
The only difference from specifying QDA is the option pool=yes. Adding the option crosslisterr results in a list of the
misclassified observations. The coefficients of the linear discriminant functions are obtained using the SAS lines.
data coefficients;
set ldaout2;
if _type_="LINEAR";
run;
proc print data=coefficients;
run;
Using the coefficients obtained from the QDA and LDA, the decision boundaries can be calculated and plotted (this only works in situations where
there are two variables). The SAS data set decision1 contains the calculations of the decision boundary for QDA.
data decision1;
do additive1=0 to 7.3 by 0.1;
quadcoef=0.0345;
lincoef=2*0.0506*additive1+1.61;
conscoef=0.0246*additive1*additive1+2.6*add120.68;
additive2=(lincoef+sqrt(lincoef*lincoef4*quadcoef*conscoef))
/(2*quadcoef);
type=3;
output;
end;
run;
The data set decision2 contains the calculations of the decision boundary for LDA.
data decision2;
do additive1=0 to 8.3 by 0.1;
additive2=(2.29*additive1+19.33)/1.49;
type=4;
output;
end;
run;
The two data sets and the original data set are combined into a single data set in the next SAS lines.
data decision3;
set section6 decision1 decision2;
run;
Pay attention to the names of the variables constructed: they match the names of the variables in the original data set. Moreover more decimals
than given in the companion chapter on classification are used in the calculation of the decision boundary of QDA, as the resulting boundary is
sensitive to rounding.
Again proc gplot is used to obtain the plot of the original observations with decision boundaries added.
symbol1 v=circle r=1;
symbol2 v=triangle r=1;
symbol3 v=none i=join r=1;
symbol4 v=none i=join r=1 line=2;
proc gplot data=decision3;
plot additive2*additive1=type;
run;
quit;
A different plot symbol is used for each class (type of food) and each decision boundary.
In order to do logistic regression the class levels are renamed to 0 and 1, respectively. This allows the SAS procedure proc genmod to be
used directly.
data section6b;
set section6;
if type=1 then newtype=0; else newtype=1;
run;
The logistic regression can be specified as follows.
proc genmod data=section6b;
model newtype=additive1 additive2 /type3;
output out=out1 predicted=posterior;
run;
The option type3 makes the SAS output contain pvalues of the terms in the model. The SAS lines result in the creation of a data set
out1 containing the posterior probabilities (predicted=posterior). The class membership of an observation is determined by
whether or not the posterior probability is smaller or larger than 0.5. A posterior probability below 0.5 implies membership of type 1 and a
posterior probability above 0.5 means membership of type 2.
data out2;
set out1;
if pred<0.5 then class=1; else class=2;
run;
The twobytwo table showing the numbers of correctly and wrongly classified observations is obtained using proc freq.
proc freq data=out2;
tables class*type;
run;
Nearest neighbour classification with
is obtained as follows.
proc discrim data=section6 out=out1 method=npar k=3;
class type;
var additive1 additive2;
run;
The nearest neighbour method is specified by means of the options method=npar and k=3, where the second option determines the
number of neighbours to be included in the majority vote on class membership.
The SAS output gives a table of the numbers of correctly and wrongly classified observations. The data set out1 contains the nearest
neighbour classification of the observations.
An important option NOT mentioned in the above is the option "Crossvalidate", used e.g. as:
proc discrim data=section6 out=qdaout1 outstat=qdaout2
method=normal pool=no Crossvalidate;
class type;
var additive1 additive2;
run;
This will provide the full leaveoneout cross validation classification results. It works for all the methods directly
used in Proc Discrim AND could hence be used as a model selction criterion (including determining the number of
neighbors in the KNN method) similar to the way crossvalidation is used in the regression module.
