Point Biserial Correlation with Python

Posted by

By John DeJesus
4 min read

Linear regression is a classic technique to determine the correlation between two or more continuous features of a data file. This is of course only ideal if the features have an almost linear relationship.

Linear Regression from Towards Data Science article by Lorraine Li.

But what if we need to determine the correlation between dichotomous (aka binary data) and continuous data? That cannot get us the straight line that we see above. That is where point biserial correlation comes to our aid.

Point Biserial Correlation and how it is computed.

The point biserial correlation coefficient is the same as the Pearson correlation coefficient used in linear regression (measured from -1 to 1). The only difference is we are comparing dichotomous data to continuous data instead of continuous data to continuous data. From this point on let’s assume that our dichotomous data is composed of items from two groups (group 0 and group 1) and the continuous data as “y”. The coefficient is computed by the formula below:

Modified Point Biserial Correlation Formula from Wikipedia.

M0 = the mean of the data from group 0.
M1 = the mean of the data from group 1.
Sy = the standard deviation of the continuous data.
n0 = the number of items in group 0.
n1 = the number of items in group 1.
n = the number of items in both groups together (aka the total rows in the data set).

Applying Point Biserial Correlation

PBC is frequently applied to the analysis of multiple-choice test questions to determine if they are sufficient questions. Being a sufficient question means that the population (say a class of students) that answers the question does not find it too easy (close to 1) or too confusing (close to 0 or negative correlation).

ZipGrade, an application that scans multiple-choice answers sheets from your phone, provides convenient CSV files on the test data we scanned. I will take one from an Algebra quiz I administered to my students. Below is the portion of data from a multiple-choice quiz I gave to my students.

Multiple-Choice Quiz data with the scores (Percent Correct) and points earned from the question in each question column.

“Percent Correct” is the score the students earned. Each “Q#” column is the points a student earned if they got the question correct (2 points) or wrong (0 points). This dataset also illustrates that the dichotomous data does not necessarily need to be strictly in the 0 or 1 format as you would encode data for a machine learning model. Since the formula only cares about the number of items in each group it does not matter how they are labeled.

SciPy conveniently has a point biserial correlation function called pointbiserialr. We will apply this function to the first question “Q1”.

import pandas as pd
from scipy.stats import pointbiserialr

# get data
data = pd.read_csv('D:\quiz-Alg2U0Quiz-standard20180510.csv')

# get continuous and dichotomous data
grades = data['Percent Correct']
question_1 = data['Q1']

# pbc of first question
pbc = pointbiserialr(question_1, grades)

This gives us the following result:

PointbiserialrResult(correlation=0.555989931309585, pvalue=1.5230357095184431e-06)

This resulted in a correlation of about 0.56 and the associated p-value to state if the correlation is statistically significant.

For our situation, this correlation translates to mean that this test question was a fair question to give to my class of students. If the correlation was close to 1, that would mean that the question was too easy and perhaps didn’t need to be included. A correlation closer to 0 or negative would suggest that the question was perhaps poorly made or not fair to give to the students. An extension of this topic could be discussed in a future article.

Point Biserial Correlation from Scratch

As we did in the article I wrote on the hypergeometric distribution, I wanted to take this opportunity to show how we could create this formula from scratch with Python.

def pbc_scratch(binary_data, continuous_data, data):
    Function that computes the point biserial correlation of two pandas data frame columns
    :param binary_data: name of dichotomous data column
    :param continuous_data: name of dichotomous data column
    :param data: dataframe where above columns come from
    :returns: Point Biserial Correlation

    bd_unique = data[binary_data].unique()
    g0 = data[data[binary_data] == bd_unique[0]][continuous_data]
    g1 = data[data[binary_data] == bd_unique[1]][continuous_data]
    s_y = np.std(data[continuous_data])
    n = len(data[binary_data])
    n0 = len(g0)
    n1 = len(g1)
    m0 = g0.mean()
    m1 = g1.mean()
    return (m0-m1)*sqrt((n0*n1)/n**2)/s_y

This version takes the names of the dichotomous data column, the continuous data column, and the associated data frame as parameters. To recreate our example, we would call the function as pbc_scratch(‘Q1’, ‘Percent Correct’, data). I do not think this is the best way to create the function so I am open to anyone that has a better implementation to please share it in the comments below.

Additional Resources

Thanks for reading! I hope you found this tutorial helpful. If you would like a video version to cement your understanding of point biserial correlation, you can check out this video below. It has a more generic example and another multiple-choice question example as well with further meaning on the interpretation.

You can follow or connect with me on Linkedin and Twitter. Contact me directly at j.dejesus22@gmail.com.

Until next time,

John DeJesus

John DeJesus
John DeJesus

Data Scientist: Education | Python | Machine Learning | Flask | Mathematics

Leave a Reply