Introduction

Frequency Distributions and Crosstabulations

T-tests

Analysis of Variance

Correlations

Regression

**Introduction**

This tutorial covers several basic statistical procedures including generating frequency histograms, performing t-tests, analysis of variance and correlational analysis. The relevant procedures here are

PROC FREQ,PROC TTEST,PROC GLMandPROC CORR. In all of the examples below, we will use thenationsdata set outlined in Tutorial 2.

**Frequency Distributions and Crosstabulations**

The SAS procedurePROC FREQis used to generate frequency distribution and crosstabulations. Its options also allow for the estimation of several commonly used statistics to test the statistical independence of categorical variables or to assess their degree of association. To start, let us generate a simple frequency distribution for some categorical variables--in this case religion, gdp and area.The tables statement inLIBNAME place 'A:\'; PROC FREQ data=place.nations; TABLE religion gdp area; RUN;PROC FREQcan be extended to generate a two (or more) way table and to generate a test of statistical independence. In this example, we examine the distribution of gdp by religion and area and perform a chi-squared test to determine whether the corresponding variables are independent of one another.LIBNAME place 'A:\'; PROC FREQ data=place.nations; TABLES religion*gdp area*gdp / CHISQ EXPECTED; RUN;

We can use aBYstatement inPROC MEANSorPROC UNIVARIATEto look at the means and variances across two or more categories. We probably want to go beyond the descriptive level, however, and attempt to determine whether any differences across the groups are statistically significant. In the case of a classifying variable with two groups, the classical t-test is probably the most common approach to the problem. In the nations data, we might want to see whether there is a significant difference in level of urbanization by religion.The above code produces the necessary output to assess whether there is a difference between the two predominant religions by urbanization. Results are provided for both equal and unequal variance t-tests. The printout will also produce the results of an F-test to see whether the two variances are equal. In the event that the two variances are equal (as is the case here) the pooled t-test are the most appropriate to consider.LIBNAME place 'A:\'; PROC TTEST data=place.nations; class religion; var pcturban; RUN;

PROC TTESTcan also be used for paired comparisons and for conducting one-sample tests. To determine whether the mean level of urbanization is equal to 60 percent, we could use the following code.Notice that the comparison population value is placed on theLIBNAME place 'A:\'; PROC TTEST data=place.nations H0=60; var pcturban; RUN;PROC TTESTstatement in the form of HO=60 (H zero equals sixty). In this case, the results are significant at alpha=.05 but not at .01.

When the classifying variable has more than two categories, the t-test is inappropriate. In this instance, we would use an Analysis of Variance to see whether there are overall differences across the categories. For balanced designs, usePROC ANOVA. More generally, however, you will findPROC GLMmore useful since it can handle both balanced and unbalanced designs. It can also be used to include quantitative covariates into the analysis (ANCOVA). Try the following code to produce a simple one-way anova.LIBNAME place 'A:\'; PROC GLM data=place.nations; class area; model pcturban=area; RUN;PROC GLMcan also produce two or more way anovas simply by altering the model statement. In the following example, we include two classifying or independent variables--URBAN and RELIGION. Because of the limited number of cases, the dependent variable will be changed to birth rate (BIRTHRTE). We will also test for possible interactions between these terms by including an urban*religion term.Note that the interaction term is not statistically significant but that the main effects are by most standard criteria. To help in the interpretation of the analysis, a 'means' statement has also been included. This results in the mean birth rate being estimated for each combination of urbanity and religion. Notice, however, that one of the cells has only one observation.LIBNAME place 'A:\'; PROC GLM data=place.nations; class urban religion; model birthrte=urban religion urban*religion; means urban*religion; RUN;

PROC GLMis also capable of solving general linear model or analysis of covariance (ANCOVA) problems. In this instance, both quantitative and qualitative variables may be included in the model. Consider the following example where birthrate is defined as a function of literacy and religion.Notice theLIBNAME place 'A:\'; PROC GLM data=place.nations; class religion; model birthrte= literacy religion /solution; RUN;/solutionoption at the end of the model statement. This option generates estimates of the model parameters (b-values) along with their estimated standard errors and t-values. Not including the/solutionoption will result in the ANOVA table only being printed.

If all of the variables in your model are quantitative or continuous, then you might want to usePROC REG.PROC REGression is far more efficient thanPROC GLMand contains several different options (see below).

Another useful statistic is the Pearson Produce Moment Correlation Coefficient, otherwise known as Pearson's r. Single or multiple correlations can be generated usingPROC CORR. This procedure will also output some basic descriptive information as well as indicating the p-value at which the correlation could be considered statistically significant. Caution should be used, however, when several correlations are produced not to 'fish around' for those that are statistically significant. Remember that at the .05 level of significance, we would expect 20 out of every 100 correlations to be statistically significant just by chance alone.Further information on the use of any of these procedure can be obtained by acessing the help screens. This is a particularly useful way for determining what options are available and how one should go about invoking them.LIBNAME place 'A:\'; PROC CORR data=place.nations out=nsort; var birthrate pcturban literacy; RUN;

The SAS procedurePROC REGis used to perform both simple and multiple regression.PROG REGassumes that all of the variables in the model are quantitative or continous. If they are not, you might want to consider runningPROC GLM(see above.) In the following example, we are regressing total fertility rate (trf) on the percent of the population considered literate and the percent of the population living in urban areas.LIBNAME place 'A:\'; PROC REG data=place.nations; MODEL tfr=literacy urban; RUN;PROC REGwill produce a summary ANOVA table for the entire model along with estimates of the b-values, their standard errors and t-values. If you require standardized b-values, put a/stbat the end of the model statement (i.e., MODEL tfr=literacy urban /stb;). A/poption will result in the observed and predicted y-values being printed as part of the output. You might want to be cautious with this option if you have a large data set since several (hundred?) pages of output can be produced.

An output statement can also be used to save the regression results. Check the help menu using the index wordregfor details.