Measures of spread tell how spread the data points are. Some examples of measures of spread are quantiles, variance, standard deviation and mean absolute deviation.
In this excercise we are going to get the measures of spread using python.
We will use a dataset from kaggle, follow https://www.kaggle.com/datasets/himanshunakrani/student-study-hours to access the data
Quantiles are values that split sorted data or a probability distribution into equal parts. There several different types of quantlies, here are some of the examples:
- Quartiles - Divides the data into 4 equal parts.
- Quintiles - Divides the data into 5 equal parts.
- Deciles - Divides the data into 10 equal parts
- Percentiles - Divides the data into 100 equal parts
Let us download the libraries we will use
import numpy as np import pandas as pd
We will now load the data that we'll use.
df = pd.read_csv('score.csv') print(df.head())
Hours Scores 0 2.5 21 1 5.1 47 2 3.2 27 3 8.5 75 4 3.5 30
Let's calculate the quartiles for the scores. These are the 5 data points in the scores that divide the scores into 4 equal parts.
print(np.quantile(df['Scores'], [0, 0.25, 0.5, 0.75, 1]))
[17. 30. 47. 75. 95.]
Quantiles using linspace( )
It can become quite tedious to list all the points when getting quantiles, more so in cases of higher quantiles such as deciles and percentiles. For such cases we can make use of the linspace( )
Let's get the quartiles of the scores
print(np.quantile(df['Scores'], np.linspace(0, 1, 5)))
[17. 30. 47. 75. 95.]
Let's get the quintiles
print(np.quantile(df['Scores'], np.linspace(0, 1, 6)))
[17. 26.6 38.6 60.8 77. 95. ]
Let's get the deciles
print(np.quantile(df['Scores'], np.linspace(0, 1, 11)))
[17. 22.2 26.6 30. 38.6 47. 60.8 68.6 77. 85.6 95. ]
Interquartile Range (IQR)
This is the difference between the 3rd and the 1st quartile. The IQR tells the spread of the middle half of the data.
Let's get the IQR for the scores
IQR = np.quantile(df['Scores'], 0.75) - np.quantile(df['Scores'], 0.25) print(IQR)
Another way we can get IQR is by using iqr( ) from the scipy library
from scipy.stats import iqr IQR = iqr(df['Scores']) print(IQR)
These are data points that are usually different or detached from the rest of the data points.
A data point is an outlier if:
data < 1st quartile − 1.5 * IQR
data > 3rd quartile + 1.5 * IQR
Let's get the outliers in the scores
# first get iqr iqr= iqr(df['Scores']) # then get lower & upper threshold lower_threshold = np.quantile(df['Scores'], 0.25) upper_threshold = np.quantile(df['Scores'], 0.75) # then find outliers outliers = df[(df['Scores'] < lower_threshold) | (df['Scores'] > upper_threshold)] print(outliers)
Hours Scores 0 2.5 21 2 3.2 27 5 1.5 20 6 9.2 88 8 8.3 81 9 2.7 25 10 7.7 85 14 1.1 17 15 8.9 95 17 1.9 24 23 6.9 76 24 7.8 86
Varience is the average of the squared distance between each data point and the mean of the data.
Let's calculate the variance of the scores. We will use np.var( )
with the 'ddof=1' included, it means that the variance we get is the sample variance, if it is excluded then we get the population variance.
Let's see that here below.
- Standard deviation
This is the squareroot of the variance.
Let's get the standard deviation of the scores
Another way we can get standard deviation is by np.std( )
Let's use that
- Mean Absolute Deviation
This is the average of the distance between each data point and the mean of the data.
Let's find the mean absolute distance of the scores
# first find the distance between the data points and the mean dists = df['Scores'] - np.mean(df['Scores']) # find the mean absolute print(np.mean(np.abs(dists)))
decsribe( ) method
The pandas describe( ) method can be used to calculate some statistical data of a dataframe. The dataframe must contain numerical data for the describe( ) method to be used.
We can make use of it to get some of the measurements that have been mentioned above.
count 25.000000 mean 51.480000 std 25.286887 min 17.000000 25% 30.000000 50% 47.000000 75% 75.000000 max 95.000000 Name: Scores, dtype: float64
- How To Install Python With Conda
- How To Parse Yahoo Finance News Feed With Python
- An Anatomy of Key Tricks in word2vec project with examples
- Python IndexError List Index Out of Range
- A Study of the TextRank Algorithm in Python
- Calculate Implied Volatility of Stock Option Using Python
- How to Sort Pandas DataFrame with Examples
- How To Append Rows With Concat to a Pandas DataFrame
- How To Analyze Yahoo Finance Data With R