Analysis of Olympic History Data Using SAS Part-01

DEVENDER PALSA
3 min readOct 26, 2022
Olympic History Data Using SAS

In this Project, we will try to understand the process of Exploratory Data Analysis (EDA) and we will also dig into data, and try to answer to some of the interview questions with SAS.
I downloaded Olympic dataset from Kaggle.
The Olympic Games are considered the world’s foremost sports competition with more than 200 countries participating. The Olympics are normally held every four years, and since 1994, have alternated between the Summer and Winter Olympics every two years during the four-year period.
This dataset contains information about the Olympics from 1896–2016. The dataset contains two files; the athletes and the region file.
The Athletes file contains 271116 rows and 15 columns. Each row corresponds to an individual athlete competing in a particular Olympic event (athlete events).
The Region file contains 230 rows and 3 columns.

What is Exploratory Data Analysis, what it’s use ?
EDA is a critical process in which we will get to know about our data by summarizing the information using statistical tools or visualize our data using Visualization tools.
EDA helps us to determine the trends, patterns, properties of variable, data types, descriptive statistics of our dataset, missing values, and aslo checking uniqueness of the data.

  1. First we will look at the metadata of our dataset. Proc Contents gives number of observations, and variables, and also give the list of variables and it’s attributes.

proc contents data=athlete_events;
run;

Dataset Metadata
Variable Metadata

2. Sneak peak at the data (get the data for only first 5 records)

proc print data=work.athlete_events (obs=5) noobs;
run;

First 5 records

3. Get the descriptive statistics of the data.

proc means data=work.athlete_events mean median mode std var min max; run;

Descriptive Statistics

4. Get the missing values in each variable.

proc means data=work.athlete_events nmiss;
run;

No of missing values

5. Get distinct Values of a variable values in data.

5.1. Count distinct values using proc sql.

proc sql;
select count(distinct Name) as Name,
count(distinct Team) as Team,
count(distinct Season) as Season
from work.athlete_events ;
quit;

5.2. Count distinct values using proc freq.

proc freq data=work.athlete_events (keep = Name Team Season) nlevels;
tables Name Team Season / nopercent nocol nocum nofreq noprint;
run;

6. Check the distribution of the data with Univariate statistical Analysis using Histogram.

Univariate gives us overall picture of the data like Moments, Basic Statistical Measures, Tests for Location, Quantiles , and also Extreme Observations.

proc univariate data=work.athlete_events novarcontents;
histogram Year ;
run;

Example using sashelp.shoes

proc univariate data=sashelp.shoes NOPRINT;
histogram sales / NORMAL;
run;

Datasets: https://github.com/devenderpalsa/Olympic-History-Data

References:

Kaggle
Clinical Trials Terminology
Introduction to Clinical Trials
An Introduction to the Standard Data Tabulation Model (SDTM)
Link between Clinical Research and SDTM
Legacy clinical data for CDISC SDTM compliance and Data Unification
AMALGAMATION OF BIG DATA ANALYTICS, SDTM, LEGACY CLINICAL DATA
Analysis DataModel Implementation Guide(ADaMIG)

--

--

DEVENDER PALSA

SAS Programmer | Data Analytics | Clinical Trials | CDISC