Statistics for Data science: Comparing The Distribution of Two Categorical Variables

Chi-Square test

Carla Martins
4 min readDec 6, 2022
Photo by Arnold Francisca on Unsplash

The Chi-Square Test is used to compare the distribution of two categorical variables from the same sample or from different samples. The test is carried out using the absolute number of observations, not the percentage of observations.

The Chi-Square Test principle is the comparison between the actual number of observations and the expected number of observations.

The expected number of observations is obtained from the null hypothesis:

  • H0: Observations in variable x are independent from observations in variable y.

OR, if we are comparing the same variable in different samples:

  • H0: The number of categorical observations in each category in each sample is the same.

The Chi-Square formula is very simple to understand and is given by:

Now, let´s try it in practice!

We will use the weather dataset that can be downloaded from Kaggle.

To import and load the data frame:

import numpy as np
import pandas as pd

df = pd.read_csv('/Users/carlamartins/Documents/datasets/weather.csv')
df

--

--

Carla Martins

Compulsive learner. Passionate about technology. Speaks C, R, Python, SQL, Haskell, Java and LaTeX. Interested in creating solutions.