Pages

Wednesday, February 13, 2019

Free SAS Tutorials: Dataset creation


Dataset

SAS dataset is generally in the form of table, i.e rows and columns. In SAS, rows are called observations and columns as variables.

Dataset creation

Below statements are used for creating a dataset:

  • DATA -  The DATA step always begins with a DATA statement. The purpose of the DATA statement is to tell SAS that you are creating a new data set i.e. outdata.
  • INPUT -  To define the variables used in data set.
  • Dollar sign ($) - To identify variable as character.
  • DATALINES - To indicate that lines following DATALINES statement a real data.
  • RUN - The DATA step ends with a RUN statement.


Example 1:


Data datasetname;
     input var1 var2 $;
     datalines;
      1 anil
      2 raj
      3 ravi
      4 neetu
      ;
run; 

Code:

Output:


Example 2:


data Test;    
input Item $ 1-6 Color $ 8-14 Investment 16-22 Profit 24-31;    
format Item Color $9. Investment  Profit  15.2;     
datalines; 
SHIRTS ORANGE  2431354 83952431 
TIES   BLUE    498432  2349123 
SUITS  BLUE    9482121 69839123 
BELTS  MAGENTA 7693    14893 
SHOES  MAGENTA 7936712 22956 
run;

The highlighted input statement is another way of creating the variables by mentioning the position numbers.

Code:



Output:





Datasets are of 2 types: Temporary and Permanent

Temporary dataset:

These datasets are not available after the session is closed or ended. Usually they are saved in work library(a storage location where sas datasets are saved). 

The below 2 statements are equivalent.

data work.sample;
or
data sample;

Permanent dataset:

These datasets are saved in a location which can be used even after closing the sas session. Usually, these datasets are referenced as libref.dataset_name. Libref is a name that is temporarily assigned to a library. Library is storage where all the sas datasets are saved. 

A LIBNAME statement associates the libref with the SAS library. In the following PROC PRINT step, PROCLIB is the library reference and EMP is the SAS data set within the library:
libname proclib 'SAS-library';
proc print data=proclib.emp;
run;

Dataset creation using PROC SQL:


Proc SQL can be used for creating a dataset by connecting to Oracle database or from existing dataset.

Example: Creation of dataset sample from existing dataset i.e sashelp.cars

Proc sql;
create table sample as
select * from sashelp.cars;
quit;

Tuesday, January 22, 2019

SAS Example : Split dataset into n equally sized datasets


Considering the cars dataset available in sashelp and splitting it to 10 equally sized datasets.


Step1:

Creating a row number using _n_ system variable option.
           
               data sample;
                   set sashelp.cars;
                        n=_n_;
               run;


Step2:

The below macro splits the data into n equally sized datasets. n is passed as a parameter to the macro.
  • Macro name : split
  • creating a cnt macro variable which holds the no. of observation count of the cars dataset. Cars dataset has 428 observations, therefore cnt = 428 
  • number macro variable holds the no. of observations to be kept in each split dataset. In the below macro, cars dataset is split into 10 datasets. therefore number = 428/10 ~ 43 observations 
  • DO loop to iterate 10 times to split the cars dataset into 10 equally sized datasets. The split datasets are named as sample_1 to sample_10.

%macro split(divide);

         proc sql;
               select max(n) into :cnt from sample;
         quit;

         data split;
               num=ceil(&cnt/&divide);
         run;

         proc sql;
               select num into :number from split;
         quit;

         %do i=1 %to ÷

                data sample_&i;
                     set sample;
                          if n>=((&i-1)*&number)+1 and n<=(&i*&number);
                run;

        %end;

%mend split;

%split(10);

Code:



Output:



Thursday, January 17, 2019

Statistics : Basics - Mean , Median and Mode

Arithmetic mean/Mean


It is the most common measure of central tendency. It serves as a balance point in a set of data. It is the only common measure in which all the values play an equal role. Mean is computed as sum of all the values divided by number of values in the data set. It is denoted by X-bar.



Example:

Age of students in class:
12 13 12 11 14 15 12 13 14

Mean=  ( 12 + 13 + 12 + 11 + 14 + 15 + 12 + 13 +14 ) / 9 = 116 / 9 = 12.88
Therefore, the average age is 12.8 ~ 13years


Median


Median is the middle value in an ordered array of the data that has been ranked from smallest to the largest. Half of the values are smaller or equal to the median, and half of the values are larger or equal to the median. In case of extreme values, median can be used as it is not affected.

Median has different calculation in case of odd and even data numbers.


Odd number of values

Median is the middle value of the sorted data

Example:
Age of students in class: 12 13 12 11 14 15 12 13 14

First sort the data : 11 12 12 12 13 13 14 14 15
Therefore, Median is 13


Even number of values

Median is the average of two middle values of the sorted data

Example:
Age of students in class: 12 13 12 11 14 15 12 13 14  12

First sort the data : 11 12 12 12 12 13 13 14 14 15 
Median = (12 + 13) /2 =12.5


Mode


Mode is the value that appears most frequently. Like the median and unlike the mean, it is not affected by the extreme values. For a particular variable, there can be several modes or no modes at all.

Example:
Age of students in class: 12 13 12 11 14 15 12 13 14 13

Since 12 and 13 has occurred 3 times, mode :12 & 13.


Comparison of Mean , Median and Mode


The determination of which average exactly suits for a specific variable depends on many factors. Certainly depends on the data level. Below are the valid averages for each level of data.

Nominal data => Mode
                   Ordinal data   => Mode and Median
                               Interval data  => Mean, Median and Mode 
                              Ratio data      => Mean, Median and Mode

For a symmetrical distribution, all the three measures  mean, median and mode are exactly same in value.
i.e, Mean = Median = Mode

Geometric Mean

When you want to measure the rate of change of the variable over time, you need to use the geometric mean instead of arithmetic mean. It is computed as nth root of product of n values.

Harmonic Mean

It is also a mathematical average. It is the reciprocal of the average of  the reciprocal of the values i.e number of values divided by sum of its reciprocals.

Example:
The time taken by 3 teams for designing a caption are 6,3 and 8min. For computing the average rate of designing the caption is

XH= 3 / (1/6 + 1/3 + 1/8) = 3/(15/24) = 24/5 = 4.8min

Arithmetic mean would be X=6+3+8/3 = 5.6min

Harmonic mean will always be less than arithmetic mean.

Sunday, January 6, 2019

Free SAS Tutorials : Basics of SAS programming

SAS stands for STATISTICAL ANALYSIS SYSTEM


SAS software (Click here for free SAS software access)

SAS has 3 windows:
  1. Editor/code window : A place where we write our codes/programs.
  2. Log window : All the executed SAS codes are printed in this window along with any error or warning messages. A PUT statement can be used to print the output in the log window.
  3. Output window(non default window) : Appears only where there is an output from the sas program mostly whenever a dataset is created(dataset is a sas file which has data in the form of rows(observations) and columns(variables)).
  4. Result window : Place where we can see the result from the sas code which is only viewed but is not saved.

Editor/code window




Log window



Output window



How does a SAS data looks like?


SAS data set contains data values that are organized as a table of observations (rows) and variables (columns) that can be processed by SAS software.

The below image has 4 variables i.e, name, age, height and weight and 3 observations.


SAS Components

  •  Base SAS – Basic procedures and data management 
  •  SAS/STAT – Statistical analysis 
  •  SAS/GRAPH – Graphics and presentation 
  •  SAS/ETS – Econometrics and Time Series Analysis 
  •  SAS/INSIGHT – Data mining 
  •  Enterprise Miner – data mining 
  •  Enterprise Guide - GUI based code editor & project manager 
  •  SAS EBI - Suite of Business Intelligence Applications 

Basic SAS programming guidelines:

  1. Each statement ends with a semicolon.
    • Example: Data sample; Input id;
  2. A single statement can be written in multiple lines or multiple sas statements can be written in single line.
  3. SAS statements are not case sensitive.
  4. Every data step and proc step should end with RUN statement and proc sql should end with QUIT statement
    • Example - data step: Data sample;var1=10;run;
    • Example - proc step: proc print data=sample;run;
    • Example - proc sql;select * from sample;quit;
  5. All the sas variable names should be between 1 to 32 characters.
  6. SAS variable names can start with a underscore (_) or character but cannot start with number. Blank spaces and few special characters like $,%, & are not allowed in the variable names.
  7. SAS code comments are specified in two ways. 
    • commenting starts with  asterisk(*) and ends with semicolon(;) --- * comment ;
      • Example: 
      • * mention the comments;
      • * comment line 1
                              * comment line 2;
    • commenting starts with /* and ends with  */ --- /* comment */
      • Example:
                              /* comment line 1
                                  comment line2
                                  comment line 3 */

SAS Programming:

Every sas code starts with either a DATA step or PROC step

DATA step

  • Data step starts with keyword DATA and ends with RUN statement
  • One of the way to create a dataset
  • Data step is used to get the data from a raw file like excel ,text etc. 
  • It is used to manage the data, create new variables, transform the data.
  • Create variables and observations
Sample code:


Output:



PROC step

  • Proc step starts with keyword PROC and end with RUN statement.
  • Proc steps are nothing but inbuilt procedures available in SAS. 
  • These are used to analyze the data, view the data and generate reports.
Sample code:


Output:




Tuesday, December 18, 2018

Free SAS tutorials: SAS software without installation

SAS OnDemand for Academics is the free software and no installation is required. Internet is required as it runs on cloud and can be accessed for anywhere.


Steps for accessing SAS OnDemand for Academics:


  1. Register yourself by clicking on the link: SAS Registration
  2. Enter your First Name, Last Name, and Email Address in the form. Select the Country in which you reside. Click Submit.
  3. You will receive an email from SAS team for activation of your profile
  4. Clink on the link and set your password by mentioning your email id
  5. You will receive another email with your user id and the link to access SAS Studio

Click on the link to sign in: Sign In to SAS

Once you login the below screen appears:


Advantages of the software:

  • Its free
  • Can be accessed for everywhere
  • No installation is required

The following products are available on the software:

  • Base SAS
  • SAS/STAT software 
  • SAS/GRAPH software
  • SAS/ETS software 
  • SAS/OR software, including OPT, PRS, IVS, and LSO 
  • SAS/IML software 
  • SAS/CONNECT 
  • SAS High-Performance Forecasting 
  • SAS/ACCESS Interface to PC Files 
  • SAS/QC software

Browsers supported:

  • Microsoft Internet Explorer 9, 10, and 11 (Microsoft Internet Explorer 10 or later is recommended for certain features, such as the ability to drag and drop files) 
  • Mozilla Firefox 21 or later 
  • Google Chrome 27 or later 
  • Apple Safari 6.0 or later (on Mac OS X) 


Please feel free to drop your comments. If you need 1-1 online training on data science, SAS & SQL, drop a comment below.


Subscribe for Email to get free updates on Data Science.

Monday, December 17, 2018

Data Science summarized in ONE picture




Please feel free to drop your comments. If you need 1-1 online training on data science, SAS & SQL, drop a comment below.


Subscribe for Email to get free updates on Data Science.

Wednesday, December 12, 2018

Case Study : CHICAGO CRIME DATA ANALYSIS

Problem Statement:  

The below location has historical crime data in the state of Chicago.   The data has different co-ordinates like location,date/time,address,zipcode etc. The requirement here is to provide visual insights dashboard from the existing data for better prevention and policing. You can use any visualization tool as you wish.

Solution:

Business Analysis:

There are many visualization tools available in the market. I have utilized Tableau for the visualisation. For insights into this data, I have done the below visualisation charts.  
    • Crimes Trending Analysis : The goal is to understand the period during which there are more crimes than the rest of periods.
    • Type and Location Trends :  Top 5 locations with more crimes. 
    • Location Analysis :   Sub-location crime analysis. Goal is to understand where exactly the police have to pay attention to during patroling.
    • Type Analysis : Different Types of crimes at different locations.
    • Geo Analysis:  This is a good geo chart with spots to identify the location analysis.
    • Zipcode Analysis: Crime Analysis by zipcode.
    • Community Area Analysis : Identify top few communities with crime rate.
    • Arrest Analysis: Provide insights where arrests are happening. The police can compare arrests analysis vs geo analysis and streamline their operations accordingly.

 Visualisation Outputs:












From the graph we can understand that the Theft is happening mostly on the Streets and Battery on Sidewalk, Residence and apartments etc



















Summary:


From the above graphs, below insights can be drawn:
  •      Most of the crimes are happening in the mid of the year i.e. May, June, July, August and during the evening hours which is after 6PM and also observed crimes happening during noon.
  •     60% of the crimes are happening in these locations: Streets, Residence, Apartment and Sidewalk.
  •      60% of the crimes are theft, battery, criminal damage, narcotics
  •      Need to concentrate on the beats per each zipcode or district
  •      Police to concentrate more on the arrest performance and also to find new techniques to arrest criminals. Only 26% of the criminals are arrested in 2015

Where should Chicago Police Focus?

  • Type Vs Location focus?
    • Street <=> Theft
    • Apartment <=> Battery
    • Sidewalk <=> Narcotics
  • Which places to focus?
    • Zipcode: 0000X, 001XX, 002XX
    • Community: 25
  • When to have increased Policing
    • May - October :12:00 PM, 18:00  - 21:00
    • Arrests : Focus on all types excluding Narcotics


      Statistical Modeling ideas:

      •        Considering the past/historic data i.e. 15years to predict the crime in future.


      •        Few of the advanced analytics we can perform are :
        •        Clustering,
        •        Hypothesis testing
        •        Trending analysis
        •        Pattern Detection


      •        Identifying the trends in the crimes can help police monitor on those areas.


      •        Logistic regression can be used to predict if a person is criminal or not.


      •        Pattern analysis can help in identifying the patterns followed by criminals.


      •        Model can be built to predict the occurrence of crime.


      •        Data required for more analysis:
        •        Information on criminal age and background, gender, race can help in identifying the crime trends.
        •        Census data to understand the population by area. Using this data we can understand the crime and population relation.


      Please feel free to drop your comments. If you need 1-1 online training on data science, SAS & SQL, drop a comment below.


      Subscribe for Email to get free updates on Data Science.