ANYTHING AND EVERYTHING ABOUT DATA SCIENCE

Wednesday, February 13, 2019

Free SAS Tutorials: Dataset creation

Dataset

SAS dataset is generally in the form of table, i.e rows and columns. In SAS, rows are called observations and columns as variables.

Dataset creation

Below statements are used for creating a dataset:

DATA - The DATA step always begins with a DATA statement. The purpose of the DATA statement is to tell SAS that you are creating a new data set i.e. outdata.
INPUT - To define the variables used in data set.
Dollar sign ($) - To identify variable as character.
DATALINES - To indicate that lines following DATALINES statement a real data.
RUN - The DATA step ends with a RUN statement.

Example 1:

Data datasetname;

input var1 var2 $;

datalines;

1 anil

2 raj

3 ravi

4 neetu

;

run;

Code:

Output:

Example 2:

data Test;

input Item $ 1-6 Color $ 8-14 Investment 16-22 Profit 24-31;

format Item Color $9. Investment Profit 15.2;

datalines;

SHIRTS ORANGE 2431354 83952431

TIES BLUE 498432 2349123

SUITS BLUE 9482121 69839123

BELTS MAGENTA 7693 14893

SHOES MAGENTA 7936712 22956

;

run;

The highlighted input statement is another way of creating the variables by mentioning the position numbers.

Code:

Output:

Datasets are of 2 types: Temporary and Permanent

Temporary dataset:

These datasets are not available after the session is closed or ended. Usually they are saved in work library(a storage location where sas datasets are saved).

The below 2 statements are equivalent.

data work.sample;

data sample;

Permanent dataset:

These datasets are saved in a location which can be used even after closing the sas session. Usually, these datasets are referenced as libref.dataset_name. Libref is a name that is temporarily assigned to a library. Library is storage where all the sas datasets are saved.

A LIBNAME statement associates the libref with the SAS library. In the following PROC PRINT step, PROCLIB is the library reference and EMP is the SAS data set within the library:

libname proclib 'SAS-library';
proc print data=proclib.emp;
run;

Dataset creation using PROC SQL:

Proc SQL can be used for creating a dataset by connecting to Oracle database or from existing dataset.

Example: Creation of dataset sample from existing dataset i.e sashelp.cars

Proc sql;

create table sample as

select * from sashelp.cars;

quit;

Tuesday, January 22, 2019

SAS Example : Split dataset into n equally sized datasets

Considering the cars dataset available in sashelp and splitting it to 10 equally sized datasets.

Step1:

Creating a row number using _n_ system variable option.

data sample;
set sashelp.cars;
n=_n_;
run;

Step2:

The below macro splits the data into n equally sized datasets. n is passed as a parameter to the macro.

Macro name : split

creating a cnt macro variable which holds the no. of observation count of the cars dataset. Cars dataset has 428 observations, therefore cnt = 428

number macro variable holds the no. of observations to be kept in each split dataset. In the below macro, cars dataset is split into 10 datasets. therefore number = 428/10 ~ 43 observations

DO loop to iterate 10 times to split the cars dataset into 10 equally sized datasets. The split datasets are named as sample_1 to sample_10.

%macro split(divide);

proc sql;
select max(n) into :cnt from sample;
quit;

data split;
num=ceil(&cnt/&divide);
run;

proc sql;
select num into :number from split;
quit;

%do i=1 %to ÷

data sample_&i;
set sample;
if n>=((&i-1)*&number)+1 and n<=(&i*&number);
run;

%end;

%mend split;

%split(10);

Code:

Output:

Thursday, January 17, 2019

Statistics : Basics - Mean , Median and Mode

Arithmetic mean/Mean

It is the most common measure of central tendency. It serves as a balance point in a set of data. It is the only common measure in which all the values play an equal role. Mean is computed as sum of all the values divided by number of values in the data set. It is denoted by X-bar.

Example:

Age of students in class:

12 13 12 11 14 15 12 13 14

Mean= ( 12 + 13 + 12 + 11 + 14 + 15 + 12 + 13 +14 ) / 9 = 116 / 9 = 12.88

Therefore, the average age is 12.8 ~ 13years

Median

Median is the middle value in an ordered array of the data that has been ranked from smallest to the largest. Half of the values are smaller or equal to the median, and half of the values are larger or equal to the median. In case of extreme values, median can be used as it is not affected.

Median has different calculation in case of odd and even data numbers.

Odd number of values

Median is the middle value of the sorted data

Example:

Age of students in class: 12 13 12 11 14 15 12 13 14

First sort the data : 11 12 12 12 13 13 14 14 15

Therefore, Median is 13

Even number of values

Median is the average of two middle values of the sorted data

Example:

Age of students in class: 12 13 12 11 14 15 12 13 14 12

First sort the data : 11 12 12 12 12 13 13 14 14 15

Median = (12 + 13) /2 =12.5

Mode

Mode is the value that appears most frequently. Like the median and unlike the mean, it is not affected by the extreme values. For a particular variable, there can be several modes or no modes at all.

Example:

Age of students in class: 12 13 12 11 14 15 12 13 14 13

Since 12 and 13 has occurred 3 times, mode :12 & 13.

Comparison of Mean , Median and Mode

The determination of which average exactly suits for a specific variable depends on many factors. Certainly depends on the data level. Below are the valid averages for each level of data.

Nominal data => Mode

Ordinal data => Mode and Median

Interval data => Mean, Median and Mode

Ratio data => Mean, Median and Mode

For a symmetrical distribution, all the three measures mean, median and mode are exactly same in value.

i.e, Mean = Median = Mode

Geometric Mean

When you want to measure the rate of change of the variable over time, you need to use the geometric mean instead of arithmetic mean. It is computed as nth root of product of n values.

Harmonic Mean

It is also a mathematical average. It is the reciprocal of the average of the reciprocal of the values i.e number of values divided by sum of its reciprocals.

Example:

The time taken by 3 teams for designing a caption are 6,3 and 8min. For computing the average rate of designing the caption is

XH= 3 / (1/6 + 1/3 + 1/8) = 3/(15/24) = 24/5 = 4.8min

Arithmetic mean would be X=6+3+8/3 = 5.6min

Harmonic mean will always be less than arithmetic mean.

Sunday, January 6, 2019

Free SAS Tutorials : Basics of SAS programming

SAS stands for STATISTICAL ANALYSIS SYSTEM

SAS software (Click here for free SAS software access)

SAS has 3 windows:

Editor/code window : A place where we write our codes/programs.
Log window : All the executed SAS codes are printed in this window along with any error or warning messages. A PUT statement can be used to print the output in the log window.
Output window(non default window) : Appears only where there is an output from the sas program mostly whenever a dataset is created(dataset is a sas file which has data in the form of rows(observations) and columns(variables)).
Result window : Place where we can see the result from the sas code which is only viewed but is not saved.

Editor/code window

Log window

Output window

How does a SAS data looks like?

A SAS data set contains data values that are organized as a table of observations (rows) and variables (columns) that can be processed by SAS software.

The below image has 4 variables i.e, name, age, height and weight and 3 observations.

SAS Components

Base SAS – Basic procedures and data management
SAS/STAT – Statistical analysis
SAS/GRAPH – Graphics and presentation
SAS/ETS – Econometrics and Time Series Analysis
SAS/INSIGHT – Data mining
Enterprise Miner – data mining
Enterprise Guide - GUI based code editor & project manager
SAS EBI - Suite of Business Intelligence Applications

Basic SAS programming guidelines:

Each statement ends with a semicolon.

Example: Data sample; Input id;

A single statement can be written in multiple lines or multiple sas statements can be written in single line.
SAS statements are not case sensitive.
Every data step and proc step should end with RUN statement and proc sql should end with QUIT statement

Example - data step: Data sample;var1=10;run;
Example - proc step: proc print data=sample;run;
Example - proc sql;select * from sample;quit;

All the sas variable names should be between 1 to 32 characters.
SAS variable names can start with a underscore (_) or character but cannot start with number. Blank spaces and few special characters like $,%, & are not allowed in the variable names.
SAS code comments are specified in two ways.

commenting starts with asterisk(*) and ends with semicolon(;) --- * comment ;

Example:
* mention the comments;
* comment line 1

* comment line 2;

commenting starts with /* and ends with */ --- /* comment */

Example:

/* comment line 1
comment line2

comment line 3 */

SAS Programming:

Every sas code starts with either a DATA step or PROC step

DATA step

Data step starts with keyword DATA and ends with RUN statement
One of the way to create a dataset
Data step is used to get the data from a raw file like excel ,text etc.
It is used to manage the data, create new variables, transform the data.
Create variables and observations

Sample code:

Output:

PROC step

Proc step starts with keyword PROC and end with RUN statement.
Proc steps are nothing but inbuilt procedures available in SAS.
These are used to analyze the data, view the data and generate reports.

Sample code:

Output:

<- Previous

Tuesday, December 18, 2018

Free SAS tutorials: SAS software without installation

SAS OnDemand for Academics is the free software and no installation is required. Internet is required as it runs on cloud and can be accessed for anywhere.

Steps for accessing SAS OnDemand for Academics:

Register yourself by clicking on the link: SAS Registration
Enter your First Name, Last Name, and Email Address in the form. Select the Country in which you reside. Click Submit.
You will receive an email from SAS team for activation of your profile
Clink on the link and set your password by mentioning your email id
You will receive another email with your user id and the link to access SAS Studio

Click on the link to sign in: Sign In to SAS

Once you login the below screen appears:

Advantages of the software:

Its free
Can be accessed for everywhere
No installation is required

The following products are available on the software:

Base SAS
SAS/STAT software
SAS/GRAPH software
SAS/ETS software
SAS/OR software, including OPT, PRS, IVS, and LSO
SAS/IML software
SAS/CONNECT
SAS High-Performance Forecasting
SAS/ACCESS Interface to PC Files
SAS/QC software

Browsers supported:

Microsoft Internet Explorer 9, 10, and 11 (Microsoft Internet Explorer 10 or later is recommended for certain features, such as the ability to drag and drop files)
Mozilla Firefox 21 or later
Google Chrome 27 or later
Apple Safari 6.0 or later (on Mac OS X)

Please feel free to drop your comments. If you need 1-1 online training on data science, SAS & SQL, drop a comment below.

Subscribe for Email to get free updates on Data Science.

Monday, December 17, 2018

Data Science summarized in ONE picture

Please feel free to drop your comments. If you need 1-1 online training on data science, SAS & SQL, drop a comment below.

Subscribe for Email to get free updates on Data Science.

Wednesday, December 12, 2018

Case Study : CHICAGO CRIME DATA ANALYSIS

Problem Statement:

The below location has historical crime data in the state of Chicago. The data has different co-ordinates like location,date/time,address,zipcode etc. The requirement here is to provide visual insights dashboard from the existing data for better prevention and policing. You can use any visualization tool as you wish.

https://data.cityofchicago.org/Public-Safety/Crimes-2015/vwwp-7yr9

Solution:

Business Analysis:

There are many visualization tools available in the market. I have utilized Tableau for the visualisation. For insights into this data, I have done the below visualisation charts.

Crimes Trending Analysis : The goal is to understand the period during which there are more crimes than the rest of periods.
Type and Location Trends : Top 5 locations with more crimes.
Location Analysis : Sub-location crime analysis. Goal is to understand where exactly the police have to pay attention to during patroling.
Type Analysis : Different Types of crimes at different locations.
Geo Analysis: This is a good geo chart with spots to identify the location analysis.
Zipcode Analysis: Crime Analysis by zipcode.
Community Area Analysis : Identify top few communities with crime rate.
Arrest Analysis: Provide insights where arrests are happening. The police can compare arrests analysis vs geo analysis and streamline their operations accordingly.

Visualisation Outputs:

From the graph we can understand that the Theft is happening mostly on the Streets and Battery on Sidewalk, Residence and apartments etc

Summary:

From the above graphs, below insights can be drawn:

Most of the crimes are happening in the mid of the year i.e. May, June, July, August and during the evening hours which is after 6PM and also observed crimes happening during noon.

60% of the crimes are happening in these locations: Streets, Residence, Apartment and Sidewalk.

60% of the crimes are theft, battery, criminal damage, narcotics

Need to concentrate on the beats per each zipcode or district

Police to concentrate more on the arrest performance and also to find new techniques to arrest criminals. Only 26% of the criminals are arrested in 2015

Where should Chicago Police Focus?

Type Vs Location focus?

Street <=> Theft
Apartment <=> Battery
Sidewalk <=> Narcotics

Which places to focus?

Zipcode: 0000X, 001XX, 002XX
Community: 25

When to have increased Policing

May - October :12:00 PM, 18:00 - 21:00

Arrests : Focus on all types excluding Narcotics

Statistical Modeling ideas:

Considering the past/historic data i.e. 15years to predict the crime in future.

Few of the advanced analytics we can perform are :

Clustering,
Hypothesis testing
Trending analysis
Pattern Detection

Identifying the trends in the crimes can help police monitor on those areas.

Logistic regression can be used to predict if a person is criminal or not.

Pattern analysis can help in identifying the patterns followed by criminals.

Model can be built to predict the occurrence of crime.

Data required for more analysis:

Information on criminal age and background, gender, race can help in identifying the crime trends.

Census data to understand the population by area. Using this data we can understand the crime and population relation.

Pages

Wednesday, February 13, 2019

Dataset

Dataset creation

Example 1:

Code:

Output:

Example 2:

Code:

Output:

Temporary dataset:

Permanent dataset:

Dataset creation using PROC SQL:

Tuesday, January 22, 2019

Considering the cars dataset available in sashelp and splitting it to 10 equally sized datasets.

Step1:

Step2:

Code:

Output:

Thursday, January 17, 2019

Arithmetic mean/Mean

Median

Odd number of values

Even number of values

Mode

Comparison of Mean , Median and Mode

Geometric Mean

Harmonic Mean

Sunday, January 6, 2019

SAS stands for STATISTICAL ANALYSIS SYSTEM

SAS software (Click here for free SAS software access)

Editor/code window

Log window

Output window

How does a SAS data looks like?

SAS Components

Basic SAS programming guidelines:

SAS Programming:

DATA step

PROC step

Tuesday, December 18, 2018

Steps for accessing SAS OnDemand for Academics:

Click on the link to sign in: Sign In to SAS

Once you login the below screen appears:

Advantages of the software:

The following products are available on the software:

Browsers supported:

Please feel free to drop your comments. If you need 1-1 online training on data science, SAS & SQL, drop a comment below.

Subscribe for Email to get free updates on Data Science.

Monday, December 17, 2018

Please feel free to drop your comments. If you need 1-1 online training on data science, SAS & SQL, drop a comment below.

Subscribe for Email to get free updates on Data Science.

Wednesday, December 12, 2018

Problem Statement:

https://data.cityofchicago.org/Public-Safety/Crimes-2015/vwwp-7yr9 Solution: Business Analysis:

Solution:

Business Analysis:

Visualisation Outputs:

Summary:

Where should Chicago Police Focus?

Statistical Modeling ideas:

Please feel free to drop your comments. If you need 1-1 online training on data science, SAS & SQL, drop a comment below.

Subscribe for Email to get free updates on Data Science.

https://data.cityofchicago.org/Public-Safety/Crimes-2015/vwwp-7yr9

Solution:

Business Analysis: