Tutorial 3: Data Transformation and Selection
Manipulating the Data



Introduction
Data Transformations
Subsetting Data
Sorting Data


Introduction

In some situations, it is necessary to transform the data before we continue with our analysis. These transformations may take one of several forms. For example, it may be necessary to add several variables together to form an index or to take the square root or logarithm of a variable. All of these and similar transformations take place under the DATA statement.

In other situations, we might wish to take a subset of the data before conducting the analysis. Sometimes we deliberately limit our analysis only to one group (e.g., immigrants or women) or to observations that fall within a selected range. For example, in an analysis of elderly hospital patients we may wish to select only those cases where the respondent is over 65 years of age. Again the DATA statement allows these types of selections to be made.

Data Transformations

The data step not only allows you to read the data into SAS, it also allows you to perform transformations or to create new variables. For example, a student's grade for a course might be the arithmetic average between the student's scores on the midterm and the final examination. This can be achieve by using coding such as: Grade=(Midterm+Final)/2.

There are some restrictions, however, that are placed on the names of created variables. They can be from 1 to 8 characters in length and must begin with a letter from A-Z (upper or lower case) or with _ (an underscore).

SAS allows the use of most standard mathematical symbols. Specifically,

Symbol Operation
+     Addition
-     Subtraction
/     Division
*     Multiplication
**    Exponentiation (i.e., X squared is X**2)
The following is an example of how the DATA statement can be used to read an existing data set, make some alterations to the data and then create a new data set.

LIBNAME place 'D:\My SAS Files';
DATA newfile;
	SET place.mydata;
	loginc=log(income);
RUN;
PROC MEANS;
	variables income loginc;
RUN;
In this example, the existing data 'mydata' is read by the SET command from the directory defined in the LIBNAME statement. The final statement within the DATA step takes the natural logarithm of income and assigns it to the new variable loginc.

SAS runs through the data procedure and creates a temporary data file. If you want to keep that file called 'newfile'. If you want to make a permanent copy of newfile, you would put the libname prefix in front of it. In other words, replace the existing DATA statement with


DATA place.newfile;
As a final step, we use PROC MEANS to provide descriptive statistics for the variables income and loginc. Other operators such as SQRT, LOG10 and INT can be used if you need to take the square root, the logarithm (base 10) or to convert a real number to an integer. Arithmetic operations are conducted in standard order (i.e., multiplication and division precedes addition and subtraction). As with standard algebra, any ambiguity can be solved by using parentheses or "round brackets."

SAS also uses most logical operators. Here you can use either the operators symbol or its abbreviation.

Symbol Abbreviation

<, <=     LT, LE    Less than, and less than and equal to
>, >=     GT, GE    Greater than, and greater than and equal to
=, ^=     EQ, NE    Equal to, and not equal to

Subsetting Data

Another useful operator is the IF statement. IF statements can be use to assist in making selective transformations and in deleting certain cases to create subsets of data. The IF statement takes the form of
 
IF expression
	THEN statement;
	ELSE statement;
For example,
 
IF HOURSWK > 20 THEN STATUS= 'FULLTIME';
	ELSE STATUS= 'PARTTIME';
IF INCOME < 50000 THEN BONUS=0;
	ELSE BONUS=INCOME * .05;
If income < 25000 or income >= 75000 then delete;
Let's return to the example where we took the logarithm of income. Suppose a researcher wishes to study only women who earn between $5,000 and $100,000 per year. In this case, if statements would be used in combination with the log transformation statement.

LIBNAME place 'D:\My SAS Files';
DATA newfile;
	SET place.mydata;
	if sex='male' then delete;
	if income lt 5000 or income gt100000 then delete;
	loginc=log(income);
RUN;
PROC MEANS;
	variables income loginc;
RUN;
Again, the existing data 'mydata' is read by the SET command from the directory defined in the LIBNAME statement. After the SET command, there are two 'if' statements; The first if statement tells SAS to delete all cases where the variable sex is assigned the alphanumeric value 'male'. The next if statement tells SAS to delete those cases where income is either less than 5000 or greater than 100000. The final statement within the DATA step takes the natural logarithm of income and assigns it to the new variable loginc.

As previously, we also use PROC MEANS to provide our descriptive statistics.

Sorting Data
Often it is useful to sort the data before processing it. In some instances, sorting is also a necessary precondition for analysis. Many SAS procedures allow you to use a BY statement. BY statements will result in output by subgroup. For example, using the code

LIBNAME place 'D:\My SAS Files';
PROC MEANS data=place.mydata;
	by sex;
	variables income loginc;
RUN;
Will result in PROC MEANS generating descriptive statistics for women and for men. If the data are not presorted, however, an error message will be generated. Consequently, we would use PROC SORT prior to running PROC MEANS. For example

LIBNAME place 'D:\My SAS Files';
PROC SORT data=place.mydata;
	OUT=newfile;
	BY SEX;
RUN;
PROC MEANS;
	by sex;
	variables income loginc;
RUN;
Here, PROC SORT reads the permanent file 'mydata' sorts it by sex and puts the sorted results in the temporary file called 'newfile'. PROC MEANS then reads 'newfile' and conducts the appropriate analysis for each category listed in the variable sex. Once again, if we wanted to keep newfile permanently, we would put the libname reference in front of it (i.e., place.newfile)

Revised: September 25, 2000