This post is related to SAS programmming which describes the basic tools and techniques that is required for performing some required elementary data analysis.
A SAS program consists of a sequence of steps. Each step in the program performs a specific task. There are two kinds of steps in SAS programs: data steps and proc or procedure steps. A SAS program can contain any combination of data steps and proc steps depending on the tasks you want to perform. A data step generally reads data from an input source, processes it, and creates a SAS table.
A data step might also filter rows, compute new columns, join tables, and perform other data manipulations. In this program, the data step is creating an output table and adding a new column.
data mydata; set sashelp.cars; run;
A proc or procedure step processes a SAS table in a specific a predefined way. SAS has dozens of procedures that generate reports and graphs, managed data, or perform complex statistical analyses.
proc print data=mydata; run;
Most steps end with a run statement, and a few proc steps and with a quit statement. If you don’t use a run statement at the end of a step, the beginning of a new data or proc step signals the end of the previous step
In addition to data and proc steps, a SAS program can also contain global statements. These statements can be outside data and proc steps, and they typically defined some option or setting for the SAS session. Global statements do not need a run statement after them.
Accessing the summary of SAS table
Suppose we have a SAS table sample.sas7bdat located in ~/project/data/. We can then generate the summary/report with the below chunk of code:
proc contents data="~/project/data/sample.sas7bdat"; run;
Often times, it is difficult to read each and every file through the hardcoded path as defined above. Use of SAS libraries helps in overcoming this situation.
We use SAS libraries to specify the two required pieces of information – the location and file type. For example, we can create the SAS library mylib with base as the engine to be enabled to read the data. The engine is the name of a behind-the-scenes set of instructions for reading structured data. There is a different engine for each type of data SAS can read, including base SAS tables, Excel, Teradata, Hadoop, and many others. Engines enable SAS to read these structured data files directly, without having to do a separate, manual import into SAS.
So, we first create the library mylib with base engine as shown below.
libname mylib base "~/project/data"; run;
We then use the library to access the SAS table sample.sas7bdat.
proc contents data=mylib.sample; run;
We can delete the library reference which means we are just deleting the pointer referencing to the library.
libname mylib clear;
There are some automatic SAS libraries – work and sashelp. The work library is a sort of temporary library which cease to exist when the SAS session restarts or ends. The sashelp library contains the collection of sample tables and other files.
Importing unstructured data
Importing the csv file:
proc import datafile="~/project/data/sample.csv" dbms=csv out=sample_import replace; run; proc contents data=sample_import; run;
Importing the xlsx file:
proc import datafile="~/project/data/sample.xlsx" dbms=xlsx out=sample_import replace; sheet=sheet0; run; proc contents data=sample_import; run;
- print – The print procedure creates a listing of all rows and columns in the data.
/* listing first 10 rows of the table */ proc print data=mylib.sample(obs=10); var col1 col2 col3 col4; run;
- means – The means procedure calculate simple summary statistics for numeric columns in the data.
/* Computing summary statistics of specific columns */ proc means data=mylib.sample; var col1 col2; run;
- univariate – The Univariate Procedure also generate summary statistics, but it includes more detailed statistics related to distribution.
/* Examine extreme values. Also can be used to compute summary statistics */ proc univariate data=mylib.sample; var col1 col2; run;
- freq – The freq Procedure creates a frequency table for each column in the input table.
/* Computing summary statistics of specific columns */ proc freq data=mylib.sample; tables col1 col2; run;
There are many other procedures as well that provides usefulness in extracting useful information about the data.
Filtering rows with WHERE statement
We can filter or subset the rows of the data by includint the where statement along with relevant procedures.
/* Printing the table with values only when col2 >= 100 */ proc print data=mylib.sample; where col2>=100; run;
/* Printing the table with values only when coldate >= 01jan2010 */ proc print data=mylib.sample; where date_col>="01jan2010"d; run;
/* Printing the table with values only when col1 >= 100 and col2 contains string "example" */ proc print data=mylib.sample; where col1>=100 and col2="example"; run;
Using Special WHERE characters
Some of the special characters that can be used in conjunction with where statements to filter the data are:
The SAS macro language is designed to help make your programs reusable and dynamic. Macro variable is one of the elements of the macro language. All statements in the macro language begin with a % sign, and the %LET statement creates a macro variable as indicated below:
/* Creating a macro variable */ %LET our_value=200; %LET our_char=Tom; %LET our_date=01Jan2010;
We can use the macro variable as shown below:
proc print data=mylib.sample; where col1>=&our_value and col2="&our_char" and date_col>="&our_date"d; run;
Formatting data values
We can format the data values in results using
format. In the example below, if we have a sample data as shown in on the left side of the below below:
We can then format the Height and Weight column in 3. format and Birthdate with the date9. format as shown on the right side of the figure above in the following way:
proc print data=mylib.sample; format Height Weight 3. Birthdate date9.; run;
There are number of numeric value and date value formats that one can find useful for their own purpose.
To sort the rows of the table, we use sort procedure as shown below:
proc sort data=input-table out=output-table; by descending colname(s); run;
Identifying and removing duplicates
We can identify and remove duplicate rows in a table in the following way:
proc sort data=input-table out=output-table noduprecs dupout=dup-output-table; by __all__; run;
proc sort data=class_test3 out=test_clean noduprecs dupout=test_dups; by __all__; run;
There is also a way in which we can identify whether you have duplicated values for particular columns. This can be achieved as below:
proc sort data=input-table out=output-table nodupkey dupout=dup-output-table; by descending colname(s); run;
proc sort data=class_test2 out=test_clean nodupkey dupout=test_dups; by Name; run;
Creating SAS dataset
The SAS data step reads an existing table and writes a new one. We can write this in form of data step as shown below:
data output-table; set input-table; run;
We can export the data using the export procedure step of SAS as shown below:
proc export data=input-table outfile="output-file" <DBMS=identifier> replace; run;
Using the SAS Output Delivery System (ODS)
Exporting a report to a .csv file can be done as below:
ods <destination><destination-specifications>; /*SAS code that produces output*/ run;
ods csvall file="data/sample.csv"; proc print data=sashelp.cars; var col1 col2; run; ods csvall close;
In a similar way, exporting a report to a excel file can be done as below:
ods excel file="data/sample.xlsx" style=<style> options(sheetname='sheet0'); /*SAS code that produces output*/ run; ods excel close;