Pages

Tuesday, January 22, 2019

SAS Example : Split dataset into n equally sized datasets


Considering the cars dataset available in sashelp and splitting it to 10 equally sized datasets.


Step1:

Creating a row number using _n_ system variable option.
           
               data sample;
                   set sashelp.cars;
                        n=_n_;
               run;


Step2:

The below macro splits the data into n equally sized datasets. n is passed as a parameter to the macro.
  • Macro name : split
  • creating a cnt macro variable which holds the no. of observation count of the cars dataset. Cars dataset has 428 observations, therefore cnt = 428 
  • number macro variable holds the no. of observations to be kept in each split dataset. In the below macro, cars dataset is split into 10 datasets. therefore number = 428/10 ~ 43 observations 
  • DO loop to iterate 10 times to split the cars dataset into 10 equally sized datasets. The split datasets are named as sample_1 to sample_10.

%macro split(divide);

         proc sql;
               select max(n) into :cnt from sample;
         quit;

         data split;
               num=ceil(&cnt/&divide);
         run;

         proc sql;
               select num into :number from split;
         quit;

         %do i=1 %to ÷

                data sample_&i;
                     set sample;
                          if n>=((&i-1)*&number)+1 and n<=(&i*&number);
                run;

        %end;

%mend split;

%split(10);

Code:



Output:



Thursday, January 17, 2019

Statistics : Basics - Mean , Median and Mode

Arithmetic mean/Mean


It is the most common measure of central tendency. It serves as a balance point in a set of data. It is the only common measure in which all the values play an equal role. Mean is computed as sum of all the values divided by number of values in the data set. It is denoted by X-bar.



Example:

Age of students in class:
12 13 12 11 14 15 12 13 14

Mean=  ( 12 + 13 + 12 + 11 + 14 + 15 + 12 + 13 +14 ) / 9 = 116 / 9 = 12.88
Therefore, the average age is 12.8 ~ 13years


Median


Median is the middle value in an ordered array of the data that has been ranked from smallest to the largest. Half of the values are smaller or equal to the median, and half of the values are larger or equal to the median. In case of extreme values, median can be used as it is not affected.

Median has different calculation in case of odd and even data numbers.


Odd number of values

Median is the middle value of the sorted data

Example:
Age of students in class: 12 13 12 11 14 15 12 13 14

First sort the data : 11 12 12 12 13 13 14 14 15
Therefore, Median is 13


Even number of values

Median is the average of two middle values of the sorted data

Example:
Age of students in class: 12 13 12 11 14 15 12 13 14  12

First sort the data : 11 12 12 12 12 13 13 14 14 15 
Median = (12 + 13) /2 =12.5


Mode


Mode is the value that appears most frequently. Like the median and unlike the mean, it is not affected by the extreme values. For a particular variable, there can be several modes or no modes at all.

Example:
Age of students in class: 12 13 12 11 14 15 12 13 14 13

Since 12 and 13 has occurred 3 times, mode :12 & 13.


Comparison of Mean , Median and Mode


The determination of which average exactly suits for a specific variable depends on many factors. Certainly depends on the data level. Below are the valid averages for each level of data.

Nominal data => Mode
                   Ordinal data   => Mode and Median
                               Interval data  => Mean, Median and Mode 
                              Ratio data      => Mean, Median and Mode

For a symmetrical distribution, all the three measures  mean, median and mode are exactly same in value.
i.e, Mean = Median = Mode

Geometric Mean

When you want to measure the rate of change of the variable over time, you need to use the geometric mean instead of arithmetic mean. It is computed as nth root of product of n values.

Harmonic Mean

It is also a mathematical average. It is the reciprocal of the average of  the reciprocal of the values i.e number of values divided by sum of its reciprocals.

Example:
The time taken by 3 teams for designing a caption are 6,3 and 8min. For computing the average rate of designing the caption is

XH= 3 / (1/6 + 1/3 + 1/8) = 3/(15/24) = 24/5 = 4.8min

Arithmetic mean would be X=6+3+8/3 = 5.6min

Harmonic mean will always be less than arithmetic mean.

Sunday, January 6, 2019

Free SAS Tutorials : Basics of SAS programming

SAS stands for STATISTICAL ANALYSIS SYSTEM


SAS software (Click here for free SAS software access)

SAS has 3 windows:
  1. Editor/code window : A place where we write our codes/programs.
  2. Log window : All the executed SAS codes are printed in this window along with any error or warning messages. A PUT statement can be used to print the output in the log window.
  3. Output window(non default window) : Appears only where there is an output from the sas program mostly whenever a dataset is created(dataset is a sas file which has data in the form of rows(observations) and columns(variables)).
  4. Result window : Place where we can see the result from the sas code which is only viewed but is not saved.

Editor/code window




Log window



Output window



How does a SAS data looks like?


SAS data set contains data values that are organized as a table of observations (rows) and variables (columns) that can be processed by SAS software.

The below image has 4 variables i.e, name, age, height and weight and 3 observations.


SAS Components

  •  Base SAS – Basic procedures and data management 
  •  SAS/STAT – Statistical analysis 
  •  SAS/GRAPH – Graphics and presentation 
  •  SAS/ETS – Econometrics and Time Series Analysis 
  •  SAS/INSIGHT – Data mining 
  •  Enterprise Miner – data mining 
  •  Enterprise Guide - GUI based code editor & project manager 
  •  SAS EBI - Suite of Business Intelligence Applications 

Basic SAS programming guidelines:

  1. Each statement ends with a semicolon.
    • Example: Data sample; Input id;
  2. A single statement can be written in multiple lines or multiple sas statements can be written in single line.
  3. SAS statements are not case sensitive.
  4. Every data step and proc step should end with RUN statement and proc sql should end with QUIT statement
    • Example - data step: Data sample;var1=10;run;
    • Example - proc step: proc print data=sample;run;
    • Example - proc sql;select * from sample;quit;
  5. All the sas variable names should be between 1 to 32 characters.
  6. SAS variable names can start with a underscore (_) or character but cannot start with number. Blank spaces and few special characters like $,%, & are not allowed in the variable names.
  7. SAS code comments are specified in two ways. 
    • commenting starts with  asterisk(*) and ends with semicolon(;) --- * comment ;
      • Example: 
      • * mention the comments;
      • * comment line 1
                              * comment line 2;
    • commenting starts with /* and ends with  */ --- /* comment */
      • Example:
                              /* comment line 1
                                  comment line2
                                  comment line 3 */

SAS Programming:

Every sas code starts with either a DATA step or PROC step

DATA step

  • Data step starts with keyword DATA and ends with RUN statement
  • One of the way to create a dataset
  • Data step is used to get the data from a raw file like excel ,text etc. 
  • It is used to manage the data, create new variables, transform the data.
  • Create variables and observations
Sample code:


Output:



PROC step

  • Proc step starts with keyword PROC and end with RUN statement.
  • Proc steps are nothing but inbuilt procedures available in SAS. 
  • These are used to analyze the data, view the data and generate reports.
Sample code:


Output: