5: SAS Programs
Some Basic Statistical Procedures
Frequency Distributions and Crosstabulations
Analysis of Variance
This tutorial covers several basic statistical procedures including generating frequency histograms, performing t-tests, analysis of variance and correlational analysis. The relevant procedures here are PROC FREQ, PROC TTEST, PROC GLMand PROC CORR. In all of the examples below, we will use the nations data set outlined in Tutorial 2.
Frequency Distributions and Crosstabulations
The SAS procedure PROC FREQ is used to generate frequency distribution and crosstabulations. Its options also allow for the estimation of several commonly used statistics to test the statistical independence of categorical variables or to assess their degree of association. To start, let us generate a simple frequency distribution for some categorical variables--in this case religion, gdp and area.LIBNAME place 'A:\'; PROC FREQ data=place.nations; TABLE religion gdp area; RUN;The tables statement in PROC FREQ can be extended to generate a two (or more) way table and to generate a test of statistical independence. In this example, we examine the distribution of gdp by religion and area and perform a chi-squared test to determine whether the corresponding variables are independent of one another.LIBNAME place 'A:\'; PROC FREQ data=place.nations; TABLES religion*gdp area*gdp / CHISQ EXPECTED; RUN;
We can use a BY statement in PROC MEANS or PROC UNIVARIATE to look at the means and variances across two or more categories. We probably want to go beyond the descriptive level, however, and attempt to determine whether any differences across the groups are statistically significant. In the case of a classifying variable with two groups, the classical t-test is probably the most common approach to the problem. In the nations data, we might want to see whether there is a significant difference in level of urbanization by religion.LIBNAME place 'A:\'; PROC TTEST data=place.nations; class religion; var pcturban; RUN;The above code produces the necessary output to assess whether there is a difference between the two predominant religions by urbanization. Results are provided for both equal and unequal variance t-tests. The printout will also produce the results of an F-test to see whether the two variances are equal. In the event that the two variances are equal (as is the case here) the pooled t-test are the most appropriate to consider.
PROC TTEST can also be used for paired comparisons and for conducting one-sample tests. To determine whether the mean level of urbanization is equal to 60 percent, we could use the following code.LIBNAME place 'A:\'; PROC TTEST data=place.nations H0=60; var pcturban; RUN;Notice that the comparison population value is placed on the PROC TTEST statement in the form of HO=60 (H zero equals sixty). In this case, the results are significant at alpha=.05 but not at .01.
Analysis of Variance
When the classifying variable has more than two categories, the t-test is inappropriate. In this instance, we would use an Analysis of Variance to see whether there are overall differences across the categories. For balanced designs, use PROC ANOVA. More generally, however, you will find PROC GLM more useful since it can handle both balanced and unbalanced designs. It can also be used to include quantitative covariates into the analysis (ANCOVA). Try the following code to produce a simple one-way anova.LIBNAME place 'A:\'; PROC GLM data=place.nations; class area; model pcturban=area; RUN;PROC GLM can also produce two or more way anovas simply by altering the model statement. In the following example, we include two classifying or independent variables--URBAN and RELIGION. Because of the limited number of cases, the dependent variable will be changed to birth rate (BIRTHRTE). We will also test for possible interactions between these terms by including an urban*religion term.LIBNAME place 'A:\'; PROC GLM data=place.nations; class urban religion; model birthrte=urban religion urban*religion; means urban*religion; RUN;Note that the interaction term is not statistically significant but that the main effects are by most standard criteria. To help in the interpretation of the analysis, a 'means' statement has also been included. This results in the mean birth rate being estimated for each combination of urbanity and religion. Notice, however, that one of the cells has only one observation.
PROC GLM is also capable of solving general linear model or analysis of covariance (ANCOVA) problems. In this instance, both quantitative and qualitative variables may be included in the model. Consider the following example where birthrate is defined as a function of literacy and religion.LIBNAME place 'A:\'; PROC GLM data=place.nations; class religion; model birthrte= literacy religion /solution; RUN;Notice the /solution option at the end of the model statement. This option generates estimates of the model parameters (b-values) along with their estimated standard errors and t-values. Not including the /solution option will result in the ANOVA table only being printed.
If all of the variables in your model are quantitative or continuous, then you might want to use PROC REG. PROC REGression is far more efficient than PROC GLM and contains several different options (see below).
Another useful statistic is the Pearson Produce Moment Correlation Coefficient, otherwise known as Pearson's r. Single or multiple correlations can be generated using PROC CORR. This procedure will also output some basic descriptive information as well as indicating the p-value at which the correlation could be considered statistically significant. Caution should be used, however, when several correlations are produced not to 'fish around' for those that are statistically significant. Remember that at the .05 level of significance, we would expect 20 out of every 100 correlations to be statistically significant just by chance alone.LIBNAME place 'A:\'; PROC CORR data=place.nations out=nsort; var birthrate pcturban literacy; RUN;Further information on the use of any of these procedure can be obtained by acessing the help screens. This is a particularly useful way for determining what options are available and how one should go about invoking them.
The SAS procedure PROC REG is used to perform both simple and multiple regression. PROG REG assumes that all of the variables in the model are quantitative or continous. If they are not, you might want to consider running PROC GLM (see above.) In the following example, we are regressing total fertility rate (trf) on the percent of the population considered literate and the percent of the population living in urban areas.LIBNAME place 'A:\'; PROC REG data=place.nations; MODEL tfr=literacy urban; RUN;PROC REG will produce a summary ANOVA table for the entire model along with estimates of the b-values, their standard errors and t-values. If you require standardized b-values, put a /stb at the end of the model statement (i.e., MODEL tfr=literacy urban /stb;). A /p option will result in the observed and predicted y-values being printed as part of the output. You might want to be cautious with this option if you have a large data set since several (hundred?) pages of output can be produced.
An output statement can also be used to save the regression results. Check the help menu using the index word reg for details.