Statistics for Data science: Comparing The Distribution of Two Categorical Variables

Chi-Square test

Carla Martins
4 min readDec 6, 2022


Photo by Arnold Francisca on Unsplash

The Chi-Square Test is used to compare the distribution of two categorical variables from the same sample or from different samples. The test is carried out using the absolute number of observations, not the percentage of observations.

The Chi-Square Test principle is the comparison between the actual number of observations and the expected number of observations.

The expected number of observations is obtained from the null hypothesis:

  • H0: Observations in variable x are independent from observations in variable y.

OR, if we are comparing the same variable in different samples:

  • H0: The number of categorical observations in each category in each sample is the same.

The Chi-Square formula is very simple to understand and is given by:

Now, let´s try it in practice!

We will use the weather dataset that can be downloaded from Kaggle.

To import and load the data frame:

import numpy as np
import pandas as pd

df = pd.read_csv('/Users/carlamartins/Documents/datasets/weather.csv')

Our dataset has 366 rows and 22 columns, but we will only use two columns: ‘RainToday’ and ‘RainTomorrow’. We can check the value counts for these two variables:

#Value counts for 'RainToday':
df['RainToday'].value_counts().plot(kind='bar', title="Number of rainy days")
#Value counts for 'RainTomorrow':
df['RainTomorrow'].value_counts().plot(kind='bar', title="Number of rainy days")



Carla Martins

Compulsive learner. Passionate about technology. Speaks C, R, Python, SQL, Haskell, Java and LaTeX. Interested in creating solutions.