**Tutorial
3: Data Transformation and Selection**

Manipulating the Data

Introduction

Data Transformations

Subsetting Data

Sorting Data

**Introduction**

In some situations, it is necessary to transform the data before we continue with our analysis. These transformations may take one of several forms. For example, it may be necessary to add several variables together to form an index or to take the square root or logarithm of a variable. All of these and similar transformations take place under theDATAstatement.

In other situations, we might wish to take a subset of the data before conducting the analysis. Sometimes we deliberately limit our analysis only to one group (e.g., immigrants or women) or to observations that fall within a selected range. For example, in an analysis of elderly hospital patients we may wish to select only those cases where the respondent is over 65 years of age. Again theDATAstatement allows these types of selections to be made.

The data step not only allows you to read the data into SAS, it also allows you to perform transformations or to create new variables. For example, a student's grade for a course might be the arithmetic average between the student's scores on the midterm and the final examination. This can be achieve by using coding such as: Grade=(Midterm+Final)/2.

There are some restrictions, however, that are placed on the names of created variables. They can be from 1 to 8 characters in length and must begin with a letter from A-Z (upper or lower case) or with _ (an underscore).

SAS allows the use of most standard mathematical symbols. Specifically,

The following is an example of how theSymbol Operation + Addition - Subtraction / Division * Multiplication ** Exponentiation (i.e., X squared is X**2)DATAstatement can be used to read an existing data set, make some alterations to the data and then create a new data set.In this example, the existing data 'mydata' is read by theLIBNAME place 'D:\My SAS Files'; DATA newfile; SET place.mydata; loginc=log(income); RUN; PROC MEANS; variables income loginc; RUN;SETcommand from the directory defined in the LIBNAME statement. The final statement within theDATAstep takes the natural logarithm of income and assigns it to the new variable loginc.SAS runs through the data procedure and creates a temporary data file. If you want to keep that file called 'newfile'. If you want to make a permanent copy of newfile, you would put the libname prefix in front of it. In other words, replace the existing

DATAstatement withAs a final step, we useDATA place.newfile;PROC MEANSto provide descriptive statistics for the variables income and loginc. Other operators such as SQRT, LOG10 and INT can be used if you need to take the square root, the logarithm (base 10) or to convert a real number to an integer. Arithmetic operations are conducted in standard order (i.e., multiplication and division precedes addition and subtraction). As with standard algebra, any ambiguity can be solved by using parentheses or "round brackets."SAS also uses most logical operators. Here you can use either the operators symbol or its abbreviation.

Symbol Abbreviation

<, <= LT, LE Less than, and less than and equal to >, >= GT, GE Greater than, and greater than and equal to =, ^= EQ, NE Equal to, and not equal to

Another useful operator is the IF statement. IF statements can be use to assist in making selective transformations and in deleting certain cases to create subsets of data. The IF statement takes the form of

For example,IF expression THEN statement; ELSE statement;

Let's return to the example where we took the logarithm of income. Suppose a researcher wishes to study only women who earn between $5,000 and $100,000 per year. In this case, if statements would be used in combination with the log transformation statement.IF HOURSWK > 20 THEN STATUS= 'FULLTIME'; ELSE STATUS= 'PARTTIME'; IF INCOME < 50000 THEN BONUS=0; ELSE BONUS=INCOME * .05; If income < 25000 or income >= 75000 then delete;Again, the existing data 'mydata' is read by theLIBNAME place 'D:\My SAS Files'; DATA newfile; SET place.mydata; if sex='male' then delete; if income lt 5000 or income gt100000 then delete; loginc=log(income); RUN; PROC MEANS; variables income loginc; RUN;SETcommand from the directory defined in theLIBNAMEstatement. After theSETcommand, there are two 'if' statements; The first if statement tells SAS to delete all cases where the variable sex is assigned the alphanumeric value 'male'. The next if statement tells SAS to delete those cases where income is either less than 5000 or greater than 100000. The final statement within theDATAstep takes the natural logarithm of income and assigns it to the new variable loginc.As previously, we also use

PROC MEANSto provide our descriptive statistics.

Often it is useful to sort the data before processing it. In some instances, sorting is also a necessary precondition for analysis. Many SAS procedures allow you to use aBYstatement.BYstatements will result in output by subgroup. For example, using the codeWill result inLIBNAME place 'D:\My SAS Files'; PROC MEANS data=place.mydata; by sex; variables income loginc; RUN;PROC MEANSgenerating descriptive statistics for women and for men. If the data are not presorted, however, an error message will be generated. Consequently, we would usePROC SORTprior to runningPROC MEANS. For exampleHere,LIBNAME place 'D:\My SAS Files'; PROC SORT data=place.mydata; OUT=newfile; BY SEX; RUN; PROC MEANS; by sex; variables income loginc; RUN;PROC SORTreads the permanent file 'mydata' sorts it by sex and puts the sorted results in the temporary file called 'newfile'.PROC MEANSthen reads 'newfile' and conducts the appropriate analysis for each category listed in the variable sex. Once again, if we wanted to keep newfile permanently, we would put the libname reference in front of it (i.e., place.newfile)