How To Get Measures Of Spread With Python

Measures of spread tell how spread the data points are. Some examples of measures of spread are quantiles, variance, standard deviation and mean absolute deviation.

In this excercise we are going to get the measures of spread using python.

We will use a dataset from kaggle, follow https://www.kaggle.com/datasets/himanshunakrani/student-study-hours to access the data

  1. Quantiles

    Quantiles are values that split sorted data or a probability distribution into equal parts. There several different types of quantlies, here are some of the examples:

    • Quartiles - Divides the data into 4 equal parts.
    • Quintiles - Divides the data into 5 equal parts.
    • Deciles - Divides the data into 10 equal parts
    • Percentiles - Divides the data into 100 equal parts

Let us download the libraries we will use

In [1]:
import numpy as np
import pandas as pd

We will now load the data that we'll use.

In [2]:
df = pd.read_csv('score.csv')
print(df.head())
   Hours  Scores
0    2.5      21
1    5.1      47
2    3.2      27
3    8.5      75
4    3.5      30

Let's calculate the quartiles for the scores. These are the 5 data points in the scores that divide the scores into 4 equal parts.

In [3]:
print(np.quantile(df['Scores'], [0, 0.25, 0.5, 0.75, 1]))
[17. 30. 47. 75. 95.]

Quantiles using linspace( )

It can become quite tedious to list all the points when getting quantiles, more so in cases of higher quantiles such as deciles and percentiles. For such cases we can make use of the linspace( )

Let's get the quartiles of the scores

In [4]:
print(np.quantile(df['Scores'], np.linspace(0, 1, 5)))
[17. 30. 47. 75. 95.]

Let's get the quintiles

In [5]:
print(np.quantile(df['Scores'], np.linspace(0, 1, 6)))
[17.  26.6 38.6 60.8 77.  95. ]

Let's get the deciles

In [6]:
print(np.quantile(df['Scores'], np.linspace(0, 1, 11)))
[17.  22.2 26.6 30.  38.6 47.  60.8 68.6 77.  85.6 95. ]

Interquartile Range (IQR)

This is the difference between the 3rd and the 1st quartile. The IQR tells the spread of the middle half of the data.

Let's get the IQR for the scores

In [7]:
IQR = np.quantile(df['Scores'], 0.75) - np.quantile(df['Scores'], 0.25)
print(IQR)
45.0

Another way we can get IQR is by using iqr( ) from the scipy library

In [8]:
from scipy.stats import iqr

IQR = iqr(df['Scores'])
print(IQR)
45.0

Outliers

These are data points that are usually different or detached from the rest of the data points.

A data point is an outlier if:

  • data < 1st quartile − 1.5 * IQR

          or
  • data > 3rd quartile + 1.5 * IQR

Let's get the outliers in the scores

In [9]:
# first get iqr
iqr= iqr(df['Scores'])
# then get lower & upper threshold
lower_threshold = np.quantile(df['Scores'], 0.25)
upper_threshold = np.quantile(df['Scores'], 0.75)
# then find outliers 
outliers = df[(df['Scores'] < lower_threshold) | (df['Scores'] > upper_threshold)]
print(outliers)
    Hours  Scores
0     2.5      21
2     3.2      27
5     1.5      20
6     9.2      88
8     8.3      81
9     2.7      25
10    7.7      85
14    1.1      17
15    8.9      95
17    1.9      24
23    6.9      76
24    7.8      86
  1. Variance

Varience is the average of the squared distance between each data point and the mean of the data.

Let's calculate the variance of the scores. We will use np.var( )

In [10]:
print(np.var(df['Scores'], ddof=1))
639.4266666666666

with the 'ddof=1' included, it means that the variance we get is the sample variance, if it is excluded then we get the population variance.

Let's see that here below.

In [11]:
print(np.var(df['Scores']))
613.8496
  1. Standard deviation

This is the squareroot of the variance.

Let's get the standard deviation of the scores

In [12]:
print(np.sqrt(np.var(df['Scores'], ddof=1)))
25.28688724747802

Another way we can get standard deviation is by np.std( )

Let's use that

In [13]:
print(np.std(df['Scores'], ddof=1))
25.28688724747802
  1. Mean Absolute Deviation

This is the average of the distance between each data point and the mean of the data.

Let's find the mean absolute distance of the scores

In [14]:
# first find the distance between the data points and the mean
dists = df['Scores'] - np.mean(df['Scores'])
# find the mean absolute 
print(np.mean(np.abs(dists)))
22.4192

decsribe( ) method

The pandas describe( ) method can be used to calculate some statistical data of a dataframe. The dataframe must contain numerical data for the describe( ) method to be used.

We can make use of it to get some of the measurements that have been mentioned above.

In [15]:
df['Scores'].describe()
Out[15]:
count    25.000000
mean     51.480000
std      25.286887
min      17.000000
25%      30.000000
50%      47.000000
75%      75.000000
max      95.000000
Name: Scores, dtype: float64

Posted by Purity on 09/02/2022