Introduction
An Example Data Set
The DATA Step
PROC PRINT
PROC CONTENTS
The Whole SAS Program
External Data Sets
Reading Text (ASCII) Data Sets
Saving a SAS System File
Reading a SAS System File
Potential Problems
SAS (Statistical Analysis System) is a flexible program for conducting data analysis. SAS contains many built-in procedures for doing descriptive, analytic and exporatory analyses. SAS allows users to conduct wide range of statistical analyses, including analysis of variance, regression, categorical data analysis, multivariate analysis, survival analysis, cluster analysis, and nonparametric analysis.
SAS has a powerful macro language that allows for the implementation of new techniques and variations on old ones. SAS is also used in both corporate and scientific environments as a data base management program for handling large and complex data structures.
While the SAS is extremely powerful, it is possible to get "up and running" with a few simple commands. This tutorial begins by introducing the basics of data entry and display. Throughout this tutorial, an example dataset called nations will be used. The dataset contains a random sample of 37 countries and 15 variables from the early 1980s. The data are listed in the Table 1 below.
| Table 1: 37 Country Sample Data from the 1980s |
|---|
Austria 7.5 55 12 73 9880 C U 0.3 80 1.5 98 2.7 Dev Eu Belgium 10.2 . 12 73 10899 C U 0.1 80 1.6 98 3.6 Dev Eu Denmark 5.1 83 10 74 12470 C U -1 79 1.6 99 4.25 Dev Eu Finland 4.8 60 14 74 10870 C U 0.3 80 1.7 100 6.5 Dev Eu France 54.4 73 15 75 11680 C U 0.4 82 1.8 99 3.5 Dev Eu Greece 9.9 65 14 74 4290 C U 0.2 80 1.5 95 14.8 Dev Eu Switzerland 6.5 58 12 76 17010 C U 0.6 83 1.6 99 2.8 Dev Eu Spain 38.2 91 13 74 5430 C U 0.3 82 1.4 97 7 Dev Eu UK 56.3 76 13 73 9660 C U 0.3 79 1.8 99 7.8 Dev Eu Italy 57.5 69 11 74 6840 C U 0.2 81 1.4 93 6.6 Dev Eu Sweden 8.3 83 11 76 14040 C U 0.5 81 1.9 99 5.7 Dev Eu Portugal 10.1 30 16 71 2450 C R 0.3 78 1.5 83 11.8 Dev Eu Netherlands 14.4 88 12 76 10930 C U 0.6 81 1.6 99 1.5 Dev Eu Norway 4.1 70 12 76 14280 C U 0.5 81 1.8 100 4.5 Dev Eu Poland 36.6 59 19 71 . C U -1 77 2.1 98 640 Dev Eu Hungary 10.7 54 12 70 2270 C U -0.1 75 1.8 99 18 Dev Eu Czechoslov 15.4 67 15 71 . C U 0.3 76 2 99 1.5 Dev Eu Gambia 0.7 18 49 35 360 I R 3.1 50 6.5 25.1 8 Emg Af Iraq 14.5 68 47 59 . I U 3.9 68 7.3 55 30 Emg Ot Pakistan 94.7 28 43 50 380 I R 2.2 57 6.7 26 11 Emg Ot Bangladesh 95.9 11 49 48 140 I R 2.8 53 5.7 29 8 Emg Ot Ethiopia 33.8 14 47 43 140 I R 3.5 52 7 55.2 9.6 Emg Af Guinea 5.4 19 47 40 310 I R 2.6 44 6.1 20 27 Emg Af Malaysia 15.1 30 31 67 1860 I R 2.3 71 3.5 65 3.6 Dev Ot Senegal 6.1 34 48 43 490 I R 3 56 6.3 28.1 1.8 Emg Af Mali 7.5 17 46 42 180 I R 2.3 47 7.1 18 . Emg Af Libya 3.3 52 46 58 8510 I U 3.1 70 5.2 50 20 Dev Af Somalia 5.3 30 47 43 290 I R 0.8 54 7.3 11.6 81.7 Emg Af Afghanistan 17.2 16 48 37 . I R 7.7 46 6.4 12 50 Emg Ot Sudan 20 21 47 48 440 I R 2.9 55 6.5 31 70 Emg Af Turkey 48.7 45 31 63 1370 I U 2.2 67 3.6 70 68.8 Emg Ot Algeria 21 52 44 60 2350 I U 2.8 64 5.4 52 5.9 Dev Af Yemen 6.2 12 48 44 500 I R 3.1 49 7.6 15 16.9 Emg Af Argentina 28 82 24 70 2520 C U 1.2 74 2.8 94 4925 Dev Am Barbados 0.3 41 17 71 2900 C U 0.6 77 2.1 99 4.7 Dev Am Bolivia 6 45 42 51 570 C U 2.1 56 4.7 63 15.5 Emg Am Brazil 131 68 31 63 2240 C U 1.9 68 3.1 76 1765 Dev Am |
Notice that the data are arranged into a rectangular array in which the rows represent observations (also called cases). In this instance the observations are countries, and the columns represent variables. Although we have tried to use meaningful variable names datasets can be confusing without more information about the data. The Table 2 below contains the codebook for the nations dataset.
| Table 2: Codebook for HSB Dataset | |||
|---|---|---|---|
| Variable Number |
Variable Name |
Variable Label |
Coded Response |
| 1 | country | Country | alphanumeric |
| 2 | pop83 | Population in 1983 (in millions) | numeric |
| 3 | pcturban | Percent of population living in urban areas | numeric |
| 4 | birthrte | Births per 1,000 population | numeric |
| 5 | life_exp | Life expectancy at birth | numeric |
| 6 | gnp82 | GNP per person in $US | numeric |
| 7 | religion | Predominant religion | C = Christian I = Islamic |
| 8 | urban | Urban if variable 3 is over 50%; rural otherwise | R = rural U = urban |
| 9 | growth | Annual rate of growth (%) in economy | numeric |
| 10 | life_fem | Female life expectancy at birth | numeric |
| 11 | tfr | Total Fertility Rate | numeric |
| 12 | literacy | Percentage of population literate | numeric |
| 13 | inflatn | Rate of inflation (%) | numeric |
| 14 | gdp | Level of economic development | Dev = developed Emg = emerging |
| 15 | area | Geographical region | Eu = Europe Af = Africa Am = America Ot = Other |
You can start your adventure in SAS programing by creating a SAS dataset using the 37 observations displayed above. A DATA statement will begin a DATA step that will create a temporary SAS dataset called work.nations. SAS procedures can only work on SAS datasets. Later on in the tutorial, we will show you how to create permanent data sets. The difference between a temporary and a permanent data set is that temporary data sets are kept in a scratch directory and deleted once the program is closed. Permanent data sets are saved for use in a later session.
In this example, the data are going to be read instream, that is, the data are entered into the Editor window and become part of the SAS program. The data are placed right after the DATALINES statement. Later in the turorial we will show you how to read data from a separate file.
Our first SAS program can be created by combining the data from Table 1 and the variable names presented in Table 2. Together, this information looks as follows.
OPTIONS ls=72; DATA nations; INPUT country $ pop83 pcturban birthrte life_exp gnp82 religion $ urban $ growth life_fem tfr literacy inflatn gdp $ area$; DATALINES; Austria 7.5 55 12 73 9880 C U 0.3 80 1.5 98 2.7 Dev Eu Belgium 10.2 . 12 73 10899 C U 0.1 80 1.6 98 3.6 Dev Eu Denmark 5.1 83 10 74 12470 C U -1 79 1.6 99 4.25 Dev Eu Finland 4.8 60 14 74 10870 C U 0.3 80 1.7 100 6.5 Dev Eu France 54.4 73 15 75 11680 C U 0.4 82 1.8 99 3.5 Dev Eu Greece 9.9 65 14 74 4290 C U 0.2 80 1.5 95 14.8 Dev Eu Switzerland 6.5 58 12 76 17010 C U 0.6 83 1.6 99 2.8 Dev Eu Spain 38.2 91 13 74 5430 C U 0.3 82 1.4 97 7 Dev Eu UK 56.3 76 13 73 9660 C U 0.3 79 1.8 99 7.8 Dev Eu Italy 57.5 69 11 74 6840 C U 0.2 81 1.4 93 6.6 Dev Eu Sweden 8.3 83 11 76 14040 C U 0.5 81 1.9 99 5.7 Dev Eu Portugal 10.1 30 16 71 2450 C R 0.3 78 1.5 83 11.8 Dev Eu Netherlands 14.4 88 12 76 10930 C U 0.6 81 1.6 99 1.5 Dev Eu Norway 4.1 70 12 76 14280 C U 0.5 81 1.8 100 4.5 Dev Eu Poland 36.6 59 19 71 . C U -1 77 2.1 98 640 Dev Eu Hungary 10.7 54 12 70 2270 C U -0.1 75 1.8 99 18 Dev Eu Czechoslov 15.4 67 15 71 . C U 0.3 76 2 99 1.5 Dev Eu Gambia 0.7 18 49 35 360 I R 3.1 50 6.5 25.1 8 Emg Af Iraq 14.5 68 47 59 . I U 3.9 68 7.3 55 30 Emg Ot Pakistan 94.7 28 43 50 380 I R 2.2 57 6.7 26 11 Emg Ot Bangladesh 95.9 11 49 48 140 I R 2.8 53 5.7 29 8 Emg Ot Ethiopia 33.8 14 47 43 140 I R 3.5 52 7 55.2 9.6 Emg Af Guinea 5.4 19 47 40 310 I R 2.6 44 6.1 20 27 Emg Af Malaysia 15.1 30 31 67 1860 I R 2.3 71 3.5 65 3.6 Dev Ot Senegal 6.1 34 48 43 490 I R 3 56 6.3 28.1 1.8 Emg Af Mali 7.5 17 46 42 180 I R 2.3 47 7.1 18 . Emg Af Libya 3.3 52 46 58 8510 I U 3.1 70 5.2 50 20 Dev Af Somalia 5.3 30 47 43 290 I R 0.8 54 7.3 11.6 81.7 Emg Af Afghanistan 17.2 16 48 37 . I R 7.7 46 6.4 12 50 Emg Ot Sudan 20 21 47 48 440 I R 2.9 55 6.5 31 70 Emg Af Turkey 48.7 45 31 63 1370 I U 2.2 67 3.6 70 68.8 Emg Ot Algeria 21 52 44 60 2350 I U 2.8 64 5.4 52 5.9 Dev Af Yemen 6.2 12 48 44 500 I R 3.1 49 7.6 15 16.9 Emg Af Argentina 28 82 24 70 2520 C U 1.2 74 2.8 94 4925 Dev Am Barbados 0.3 41 17 71 2900 C U 0.6 77 2.1 99 4.7 Dev Am Bolivia 6 45 42 51 570 C U 2.1 56 4.7 63 15.5 Emg Am Brazil 131 68 31 63 2240 C U 1.9 68 3.1 76 1765 Dev Am ; RUN;
If you want, you can open up the SAS program and copy and paste the material between the OPTIONS and the RUN statements into the Editor window in SAS. Both SAS and your browser should use the standard windows commands for blocking, copying and pasting the exerpt.
Let's now look at the statements individually. The OPTIONS statement tells SAS to limit the output to a linesize of 72 characters. Typically, SAS will generate output or results in a window that is 80 characters in width. The DATA statement tells SAS that it should be prepared to enter or manipulate some data. The word nations on the DATA statement provides the data set with a name or handle to which we can refer in later analyses. The third line, or INPUT statement is where we list the names of the variables. Notice that the variables that contain alphanumerics (characters) have a '$' sign following them. Neglecting the $ sign will generate and error message which will appear in the Log window.
The next statement, DATALINES, tells SAS that the data are to follow immediately. In some documentation, you will find that the DATALINES statement is replaced by a CARDS statement. That is a throwback to the old days when data were entered into mainframe computers on punch cards.
Notice that that the end of the data is signified by placing a semicolon on a line by itself. The RUN; statement tells SAS that we are at the end of the DATA step. The SAS DATA step does not do anything but enter the data. An indication that the data have been read should appear in the Log window. To ensure that the data have been read in properly, it is common to print out some or all of the data. This can be achieved by using the PROC PRINT procedure.
PROC PRINT is used to display the contents of a SAS dataset. PROC PRINT is often used to assure us that the data were read into SAS correctly.
In this instance, we have used the DATA= option to indicate which dataset is to be used. If the '=nations' section were left off, SAS would use the last active data set. Because it is possible to use more than one dataset at a time, it is good practice to name the dataset to be used so that there is no confusion.
PROC PRINT DATA=nations; RUN;
PROC CONTENTS displays information about a SAS dataset.
PROC CONTENTS DATA=nations; RUN;
The most important things to note right now is that there are 37 observations on 15 variables and that the variables are listed in alphabetical order.
Let's try the PROC CONTENTS again, this time using the POSITION option to display the variables in the order in which they were entered.
PROC CONTENTS DATA=nations POSITION; RUN;
It is possible to put all of the SAS statements that were used above in a single SAS program. Review the whole program and see if you can identify the the following parts: The DATA step, the data, and the PROC steps.
OPTIONS ls=72; DATA nations; INPUT country $ pop83 pcturban birthrte life_exp gnp82 religion $ urban $ growth life_fem tfr literacy inflatn gdp $ area $; DATALINES; Austria 7.5 55 12 73 9880 C U 0.3 80 1.5 98 2.7 Dev Eu Belgium 10.2 . 12 73 10899 C U 0.1 80 1.6 98 3.6 Dev Eu Denmark 5.1 83 10 74 12470 C U -1 79 1.6 99 4.25 Dev Eu Finland 4.8 60 14 74 10870 C U 0.3 80 1.7 100 6.5 Dev Eu France 54.4 73 15 75 11680 C U 0.4 82 1.8 99 3.5 Dev Eu Greece 9.9 65 14 74 4290 C U 0.2 80 1.5 95 14.8 Dev Eu Switzerland 6.5 58 12 76 17010 C U 0.6 83 1.6 99 2.8 Dev Eu Spain 38.2 91 13 74 5430 C U 0.3 82 1.4 97 7 Dev Eu UK 56.3 76 13 73 9660 C U 0.3 79 1.8 99 7.8 Dev Eu Italy 57.5 69 11 74 6840 C U 0.2 81 1.4 93 6.6 Dev Eu Sweden 8.3 83 11 76 14040 C U 0.5 81 1.9 99 5.7 Dev Eu Portugal 10.1 30 16 71 2450 C R 0.3 78 1.5 83 11.8 Dev Eu Netherlands 14.4 88 12 76 10930 C U 0.6 81 1.6 99 1.5 Dev Eu Norway 4.1 70 12 76 14280 C U 0.5 81 1.8 100 4.5 Dev Eu Poland 36.6 59 19 71 . C U -1 77 2.1 98 640 Dev Eu Hungary 10.7 54 12 70 2270 C U -0.1 75 1.8 99 18 Dev Eu Czechoslov 15.4 67 15 71 . C U 0.3 76 2 99 1.5 Dev Eu Gambia 0.7 18 49 35 360 I R 3.1 50 6.5 25.1 8 Emg Af Iraq 14.5 68 47 59 . I U 3.9 68 7.3 55 30 Emg Ot Pakistan 94.7 28 43 50 380 I R 2.2 57 6.7 26 11 Emg Ot Bangladesh 95.9 11 49 48 140 I R 2.8 53 5.7 29 8 Emg Ot Ethiopia 33.8 14 47 43 140 I R 3.5 52 7 55.2 9.6 Emg Af Guinea 5.4 19 47 40 310 I R 2.6 44 6.1 20 27 Emg Af Malaysia 15.1 30 31 67 1860 I R 2.3 71 3.5 65 3.6 Dev Ot Senegal 6.1 34 48 43 490 I R 3 56 6.3 28.1 1.8 Emg Af Mali 7.5 17 46 42 180 I R 2.3 47 7.1 18 . Emg Af Libya 3.3 52 46 58 8510 I U 3.1 70 5.2 50 20 Dev Af Somalia 5.3 30 47 43 290 I R 0.8 54 7.3 11.6 81.7 Emg Af Afghanistan 17.2 16 48 37 . I R 7.7 46 6.4 12 50 Emg Ot Sudan 20 21 47 48 440 I R 2.9 55 6.5 31 70 Emg Af Turkey 48.7 45 31 63 1370 I U 2.2 67 3.6 70 68.8 Emg Ot Algeria 21 52 44 60 2350 I U 2.8 64 5.4 52 5.9 Dev Af Yemen 6.2 12 48 44 500 I R 3.1 49 7.6 15 16.9 Emg Af Argentina 28 82 24 70 2520 C U 1.2 74 2.8 94 4925 Dev Am Barbados 0.3 41 17 71 2900 C U 0.6 77 2.1 99 4.7 Dev Am Bolivia 6 45 42 51 570 C U 2.1 56 4.7 63 15.5 Emg Am Brazil 131 68 31 63 2240 C U 1.9 68 3.1 76 1765 Dev Am ; RUN; PROC PRINT DATA=nations; RUN; PROC CONTENTS DATA=nations; RUN; PROC CONTENTS DATA=nations POSITION; RUN;
Large amounts of data can become difficult to include in the DATA section of your program. Fortunately, the data can be read directly from an external data file. SAS has the ability to read files in several formats, including files in excel, Dbase and other statistical formats. Often, however, data are provided to use in text or ASCII format. SAS can easily input those data directly if the entries in the data set are tab, space or comma delimited. That is, each data element must be separated by a tab, space or comma. It is also possible for SAS to read data where the observations run into each other but the variables are located in the same column for each observation.
SAS also has the ability to save and read its own system or binary files. The advantage of creating system files is that they can be processed much more quickly. This may not be an issue when the data contain only a few observations but it may become one when the data contain several thousand records.
Reading Text (ASCII) Data Sets
In the following example, it is assumed that the data comprise a comma, space or tab delimited text file. For instructions on importing other file formats, see the on-line SAS documentation under "Help".
To access an external data file in text format, a few changes need to be made to the DATA step. Specifically,
1) the DATALINES or CARDS statement is deleted, and
2) an INFILE statement is added.
The INFILE statement tells SAS where to find the dataset. The path to the dataset is enclosed in quotes on the INFILE statement
DATA nations;
INFILE 'A:\sas\nations.txt';
INPUT country $ pop83 pcturban birthrte life_exp gnp82 religion $ urban $
growth life_fem tfr literacy inflatn gdp $ area $;
RUN;
Often it is useful to save a SAS sytem file for later use. We do this by first creating a folder or directory in which the file is to be placed. For example, we might create the folder c:\myfiles. The SAS system refers to these folders as libraries. It is possible to create several libraries containing numerous files within each. This is a convenient way of organizing files that have common themes.Once the directory is created, it is reference through a LIBNAME statement.
LIBNAME storage 'c:\myfiles';
DATA storage.nations;
INFILE 'A:\sas\nations.txt';
INPUT country $ pop83 pcturban birthrte life_exp gnp82 religion $ urban $
growth life_fem tfr literacy inflatn gdp $ area $;
RUN;
Here, the LIBNAME statement assigns the directory or library location the name 'storage.' This is particuarly handy when we are dealing with a complex subdirectory structure; it saves us the effort of writing out the location in full. Later in the program, files can be associated with the library by connecting the libname with the file name by placing a period between the two. When a file name is associated with a libname, SAS makes the assumption that the file is to be a permanent one and saves it in that library. Without a libname associated with it, files are treated as temporary entities and deleted once the we exit the program. By puting 'storage.nations' on the DATA statement, the data will be save as a system file in the directory c:\myfiles.
Once we have created and saved a SAS sytem file in a folder, we can access it by using the LIBNAME structure. Again, assume that we have create the folder c:\myfiles and stored the SAS data set in it. The system file can then be accessed by using the set statement within the DATA section.
LIBNAME storage 'c:\myfiles'; DATA newnats; SET storage.nations; RUN; PROC PRINT (obs=10); RUN;
Here, the set statement identifies the system file we want to read while the term 'newnats' on the DATA statement creates a temporary file called newnats (naturally). The PROC PRINT statement will print out the first 10 observations.
Problems to look out
for...
Revised: August 25, 2000