Statistics for Data science: Comparing The Distribution of Two Categorical Variables
The Chi-Square Test is used to compare the distribution of two categorical variables from the same sample or from different samples. The test is carried out using the absolute number of observations, not the percentage of observations.
The Chi-Square Test principle is the comparison between the actual number of observations and the expected number of observations.
The expected number of observations is obtained from the null hypothesis:
- H0: Observations in variable x are independent from observations in variable y.
OR, if we are comparing the same variable in different samples:
- H0: The number of categorical observations in each category in each sample is the same.
The Chi-Square formula is very simple to understand and is given by:
Now, let´s try it in practice!
To import and load the data frame:
import numpy as np
import pandas as pd
df = pd.read_csv('/Users/carlamartins/Documents/datasets/weather.csv')
Our dataset has 366 rows and 22 columns, but we will only use two columns: ‘RainToday’ and ‘RainTomorrow’. We can check the value counts for these two variables:
#Value counts for 'RainToday':
df['RainToday'].value_counts().plot(kind='bar', title="Number of rainy days")
#Value counts for 'RainTomorrow':
df['RainTomorrow'].value_counts().plot(kind='bar', title="Number of rainy days")