Hypergeometric Distribution Explained With Python

Posted by

By John DeJesus
7 min read

With probability problems in a math class, the probabilities you need are either given to you or it is relatively easy to compute them in a straight-forward manner.

But in reality, this is not the case. You need to compute the probability yourself based on the situation. That is where probability distributions can help.

Today we are going to explore the hypergeometric probability distribution by:

  1. Explaining what situations it is useful for.
  2. Giving the information that you need to apply this distribution.
  3. Coding some of the computations from scratch using Python.
  4. Applying our code to problems.

When do we use the hypergeometric distribution?

The hypergeometric distribution is a discrete probability distribution. It is used when you want to determine the probability of obtaining a certain number of successes without replacement from a specific sample size. This is similar to the binomial distribution, but this time you are not given the probability of a single success. Some example situations to apply this distribution are:

  1. The probability of getting 3 spades in a 5 card hand in poker.
  2. The probability of getting 4 to 5 non-land cards in an opening hand in Magic: The Gathering for a standard 60 card deck.
  3. The probability of drawing 60% boys for the freshman class from a mixed-gender group randomly selected in a charter school admissions lottery.

What parameters (information) do we need for hypergeometric computations?

To compute the probability mass function (aka a single instance) of a hypergeometric distribution, we need:

a) The total number of items we are drawing from (called N).
b) The total number of desired items in N (called A).
c) The number of draws from N we will make (called n).
d) The number of desired items in our draw of n items (called x).

There are different letters used for these variables depending on the tutorial. I am using the letters used from the video I posted below where I initially learned about the hypergeometric distribution.

Coding the Hypergeometric PMF, CDF, and plot functions from scratch

Recall the Probability Mass Function (PMF) is what allows us to compute the probability of a single situation. In our case, that is the specific value for x above. The hypergeometric distribution PMF is below.

PMF for the Hypergeometric Distribution
import numpy as np
import matplotlib.pyplot as plt
from scipy.special import comb

def hypergeom_pmf(N, A, n, x):
    Probability Mass Function for Hypergeometric Distribution
    :param N: population size
    :param A: total number of desired items in N
    :param n: number of draws made from N
    :param x: number of desired items in our draw of n items
    :returns: PMF computed at x
    Achoosex = comb(A,x)
    NAchoosenx = comb(N-A, n-x)
    Nchoosen = comb(N,n)
    return (Achoosex)*NAchoosenx/Nchoosen

We import numpy for our computations later with our other functions. Matplotlib will be to create our plot function later. The comb function from scipy is a built-in function to compute our 3 combinations in our PMF. We create a variable for each combination we need to compute and return the computation for the PMF.

The Cumulative Distribution Function (CDF) is a function that computes the total probabilities for a range of values for x. This will allow us to solve the second example with the Magic: The Gathering game. Another example is determining the probability of obtaining at most 2 spades in a five-card hand (aka 2 or fewer spades).

To answer the spades question with at most 2 spades, we need the CDF below:

For the Magic: The Gathering game scenario, we can use the above function, but it needs to start at 4 and end at 5.

PMF sum of a set of x-values.

Luckily with Python, we can create a function flexible enough to handle both of these problems.

def hypergeom_cdf(N, A, n, t, min_value=None):
    Cumulative Density Funtion for Hypergeometric Distribution
    :param N: population size
    :param A: total number of desired items in N
    :param n: number of draws made from N
    :param t: number of desired items in our draw of n items up to t
    :returns: CDF computed up to t
    if min_value:
        return np.sum([hypergeom_pmf(N, A, n, x) for x in range(min_value, t+1)])
    return np.sum([hypergeom_pmf(N, A, n, x) for x in range(t+1)])

Finally, below is the code needed to plot the distribution for your situation. Some of the code and styling was based on this example from this scipy docs page.

def hypergeom_plot(N, A, n):
    Visualization of Hypergeometric Distribution for given parameters
    :param N: population size
    :param A: total number of desired items in N
    :param n: number of draws made from N
    :returns: Plot of Hypergeometric Distribution for given parameters
    x = np.arange(0, n+1)
    y = [hypergeom_pmf(N, A, n, x) for x in range(n+1)]
    plt.plot(x, y, 'bo')
    plt.vlines(x, 0, y, lw=2)
    plt.xlabel('# of desired items in our draw')
    plt.title('Hypergeometric Distribution Plot')

Notice that since we are showing all possibilities from 0 to n, we do not need to create a parameter for x in this function. To see the hypergeometric distribution of the card scenario with a hand of 5, set N = 52, A = 13 and n = 5. You will then obtain the plot below:

Hypergeometric Distribution plot of example 1

Applying our code to problems

Problem 1

Now to make use of our functions. To answer the first question we use the following parameters in the hypergeom_pmf since we want for a single instance:

N = 52 because there are 52 cards in a deck of cards.

A = 13 since there are 13 spades total in a deck.

n = 5 since we are drawing a 5 card opening hand.

x = 3 since we want to draw 3 spades in our opening hand.

Computing hypergeom_pmf(52, 13, 5, 3) gives us a probability of 0.08154261704681873, which is about 8.1%. Not a good chance of getting that hand.

Problem 2

For the second problem, I will provide some quick background information for you. Magic: The Gathering (Magic for short) is a collectible trading card game where players use creatures and spells to beat their opponents. Players start each game with 7 cards from a 60 card deck.

A deck consists of land cards that allow you to cast your spells, and the spells themselves (non-land cards). Depending on the deck, you usually want at 4–5 non-land cards in your opening hand and about 23 land cards in your deck.

Assuming the above, let’s compute the probability of getting 4 or 5 non-land cards. We will use the hypergeom_cdf with the following parameters:

N = 60 since a deck has 60 cards.

A = 37 since we are assuming there are 23 lands in our 60 card deck.

n = 7 since we will be starting with a 7 card hand.

t = 5 since that is the max number of non-lands we want.

min_value = 4 since that is the minimum number of non-land cards we want in our opening hand.

Computing hypergeom_cdf(60, 37, 7, 5, 4) gives us a probability of 0.5884090217751665, which is about 59%. This means we have about a 59% chance of drawing 4 or 5 non-land cards in our opening hand of 7. Not bad. How would you change the number of lands and spells in your deck to increase this probability?

Problem 3

Suppose you run a charter school and you admit students through a lottery system. You have 734 applicants, of which 321 are boys and 413 girls. You are only admitting the first 150 students that are randomly selected. Considering your school has more girls than boys, you hope to get more boys this year. Suppose you want to know what your chances are of admitting 90 boys to your school (60% of the 150).

Take a sec to compute your answer then check past the picture below.

You should have used the hypergeom_pmf since this is a single instance probability. Using hypergeom_pmf(734, 321, 150, 90) you should have obtained 3.1730164380350626e-06…which is less than one percent… Let’s see a visual of the distribution to see where we have a better chance.

Hypergeometric distribution from Problem 3

Seems we would have a better chance if we try to get about 65 boys, but the probabilities are still fairly low. Maybe you should change your admission process…

Harbin Mahjong Opening Hand Problems

Now to get you to practice with a game you may not be familiar with. Mahjong is a famous card game played all over China but has the same general win condition. It is often played for money, but that is not how I play. I am also not encouraging you to gamble…

The version of Mahjong we will be discussing is from a northeast area of China known as Harbin. It is home to the famous Harbin Ice Festival and my in-laws.

The game has 3 suits, numbered one to nine, with 4 copies of each card. This so far totals 108 cards. In addition, there are 4 copies of the Zhong card (the red block standing up in the image above), bring our total to 112 cards. In Harbin mahjong, you win based on fulfilling particular conditions in addition to the winning hand. By getting at least 3 Zhong cards in your opening hand, you cover almost all of those conditions.

The question is what is the probability of obtaining at least 3 Zhong cards in your opening hand if you are not the dealer? This means your opening hand will have 13 cards. Try to compute this probability on your own using the correct function provided. Then post your answer in the comments below and I’ll let you know if you got it correct.

Additional Resources

Thanks for reading! To see the code for this tutorial you can find it here at my Github.

If you would like a video tutorial of the information presented here, you can check out the video below. This is the one I used to learn about the hypergeometric distribution.

You can follow or connect with me on Linkedin and Twitter. Contact me directly at j.dejesus22@gmail.com.

Until next time,

John DeJesus

John DeJesus
John DeJesus

Data Scientist: Education | Python | Machine Learning | Flask | Mathematics

Leave a Reply