How Teachers Can Quickly Group Their Students with Machine Learning

By John DeJesus
12 min read

Suppose you are a teacher that just gave an assessment to your students. You are looking over the results on the students’ papers or through an Excel file, trying to figure out how to group your students to improve their understanding of two concepts for a review lesson.

Image from nytimes.com

While you are shifting through your papers or file, your whole staff is called for a full faculty meeting…You then discover this meeting involves no laptops, papers, and the meeting was about something they could have just emailed you about…

Image from conferencesthatwork.com

If you are a teacher you share the pain of not having enough time to give your all to your students or having that time stolen from you. There are so many times when more is added to your plate of multi-disciplinary skills not listed in the career description. Luckily though there are companies and tools such as ClassDojo, ZipGrade, Kahoot, Plickers, and GradeCam that exist to help teachers save time.

Photo by Ameen Fahmy on Unsplash

ClassDojo, in particular, has a random group generator built into their app which saves teachers a ton of time with assigning groups. But random grouping is not the only grouping strategy. It is also not a good strategy for every review lesson for an assessment. As a teacher, you want to do this based on the assessment data. I mentioned this in a previous article here.

I also mentioned in that article that I built a solution for this. I am going to share that solution now. But first, let’s set the stage for the data we will be using.


The Assessment Data

We will do this for an essay rubric example. You can do the same with project rubrics, short answer questions, or multiple-choice points data as well. Suppose you gave an essay with the rubric below:

Image from SlideShare.net by Jenny Tuazon

Suppose then you graded all the papers and got the following Excel File or Google Sheet:

Self-made fictional sample data based on above rubric.

***Note: This website is self-made. To avoid any potential risks, do not but any full student names, student IDs, or any confidential data you are worried about leaking out.***


The Web App

Logo from my site’s home page

TeacherBoard is a site I built to save teachers time analyzing their data and empower them with data tools they didn’t have access to before. It can take any Excel file or Google Sheet link (link assuming your school network didn’t restrict access). For those curious about the technical aspects, I will separate that information into italics. I built this website from scratch using Python, HTML, CSS, and a bit of Javascript. I used Flask as the web framework, Git for version control and it is currently hosted on Heroku.

I will only focus on the grouping service for this article. After you sign up and log in, you are of course welcome to check out the other features also. Note that it is a prototype. There are still improvements that can be made.


Data Upload

After you log in you should get to here:

Logged in to the homepage. “Analyze Student Text Data” is not available.

You will then hover and click on the “Create Categories” option. You will then be prompted to load your data through a Google Sheet link or upload an Excel File. These will then get uploaded to a database after a small ETL process is applied. The data gets serialized into a pickle file format that sits in the database. When the data is needed for a feature, it gets deserialized and uploaded from the database.

Upload page to input Google Sheet or upload Excel File.
Excel File from above loaded.

Once you press the “Load Sheet” button, you will be taken to the page below and be ready to create your groups.

Create Categories Page to create your categories for groups.

Creating your Categories and Groupings

You will now see the column names of your data listed on the left, dropdown menus and input boxes on the right, and a description of the service at the bottom. Suppose you want to compare how your students did on Content of the essay vs their Organization of that content. You will first select each option from the dropdowns.

You will next select the number of categories you will create groups from. Let’s say 4 for now since there are potentially four possibilities for how students performed in the Content and Organization sections:

  1. Students scored low in both.
  2. Students scored high in Content but low in Organization.
  3. Students scored high in Organization but low in Content.
  4. Students scored high in both.

Next, you will enter any column names you want to be included in the results. In our case that is the name so the students from the First Name column.

Items selected and data inputted as described above.

Once that is done, press the “Create Categories” button and you will be greeted with the following scatterplot. The scatterplot is created with plotly.

Scatterplot of Content (Column 1) and Organization (Column 2) score data plotted and color-coded by categories created by machine learning algorithm (explained further down).

The scatterplot shows the score data of Content (Column 1) vs the score data of Organization (Column 2). The color coding is the categories made by the machine learning algorithm. So how do we interpret these categories? Let’s go through them color by color:

  1. Blue-Students that scored low in Content and high in Organization
  2. Orange-Students that scored low in both Content and Organization
  3. Green-Students that scored high in Content and low in Organization
  4. Red-Students that scored above average in both Content and Organization

Now that we have the visual version of this, it would be great to see these categories separated by tables with the names of the students to start making our groups. Well, if you scroll down you get exactly that! Below are the tables for each category.

Blue-Students that scored low in Content and high in Organization
Orange-Students that scored low in both Content and Organization
Green-Students that scored high in Content and low in Organization
Red-Students that scored above average in both Content and Organization

Now you have physically separated categories to pull from to create your groups. Let’s say we want to create pairs where students could learn from each other on the two topics. We can create the groups as follows:

Pair the students from Category 1 with students from Category 3 since they excelled in the complementary selected skills. They can potentially teach each other on these skills.

Group A: Graham and Maria

Group B: Jen and Melissa

Joel and Tito from Category 4 were above average for both skills. So we will pair them up with the students in Category 2, the students that scored low in both. We will include Reid from Category 1 since he can provide Chulane and Tito with extra input on how to organize their essays.

Group C: Alela and Joel

Group D: Chulane and Tito and Reid

Would you have grouped the students in a different way based on this data? How and why?


Technical Explanation of Feature

Below is an explanation of the algorithm, how I applied the algorithm, and two chunks of the code. If you are curious about it please read on. Otherwise, you can skip the mass of italic text and code chunks.

The algorithm used is an unsupervised machine learning algorithm known as K-Means. From the number of clusters you wish to create, it will classify all of those data points using a distance metric such as the classic Euclidean distance. Muthu Krishnan does a good job of explaining the math behind the algorithm in this post. To explain the K-means process based on his post:

  1. The algorithm creates a number of centroid points randomly positioned from the number of desired clusters. 4 in our scenario here.
  2. It then uses a distance metric (Euclidean for this scenario) to measure the distance of each data point to each centroid point.
  3. The data point is then assigned to the cluster of the centroid point it is closest to.
  4. The values of the data are then averaged (hence K-Means) in each cluster.
  5. The centroid points are then relocated based on the averages made in step 4.
  6. Steps 2 to 5 are repeated until the centroid points no longer move due to the average value or for a preset number of iterations.

For this app, I used sklearn for the K-means functionality. I created a class for this entire feature that encompasses the classification, plotting, and the creation of the tables. Below are the code gists for the class and the backend function to produce the web-page after the data is loaded.

from sklearn.cluster import KMeans
import json
import plotly
import plotly.graph_objs as go


class KmeansGrouper:

    # may need a default about of clusters (maybe 2?) since a convergence warning due to duplicates values
    # ex: only found 2 when we requested 4.
    def __init__(self, clusters):
        self.clusters = clusters
        self.kmeans_obj = KMeans(n_clusters=self.clusters, init='k-means++', max_iter=300, n_init=10, random_state=0)

    def scatter_plot(self, kmeans_data, groups):
        '''
        Create the plot for the kmeans grouping to display in the kmeans dashboard
        :param kmeans_data: data with groupings attached
        :param groups: list of groupings from kmeans
        :return: graphjson with chart to plot
        '''

        # Get length of unique groups
        groups_set_length = len(set(groups))

        # create traces for scatter plot (+1 for categories to start at 1)
        trace = [self._scatterplot_maker(i, kmeans_data, groups) for i in range(1, groups_set_length+1)]

        # create figure
        scatter_plot = dict(data=trace,
                            layout=dict(title='<b>Categories Plot</b>',
                                        xaxis=dict(title='Column 1'),
                                        yaxis=dict(title='Column 2'),
                                        hovermode='closest'
                                        )
                            )

        graphJSON = json.dumps(scatter_plot, cls=plotly.utils.PlotlyJSONEncoder)

        return graphJSON

    def _scatterplot_maker(self, group_label, data_np, groups):

        # create traces
        trace = go.Scatter(
            x=data_np[groups == group_label, 0],
            y=data_np[groups == group_label, 1],
            name=str(group_label),
            mode='markers',
            marker=dict(size=12,
                         line=dict(
                             color='black',
                             width=1
                         ))
        )

        return trace

    def kmeans_categories(self, data):
        '''
        Apply kmeans to data with 2 columns selected by the user
        :param data: dataframe from 2 columns entered by the user
        :return: dataframe with kmeans categories created
        '''
        data_np = data.values
        categories = self.kmeans_obj.fit_predict(data_np)
        categories = categories + 1
        data['Category'] = categories
        return data, categories, data_np

    def dataframe_constructors(self, df_with_groupings):
        '''
        Creates list of dataframes split by the categories
        :param df_with_groupings: dataframe with groupings from kmeans
        :return: list of dataframes equal to the number of categories
        '''
        categories = sorted(df_with_groupings.Category.unique().tolist())
        dataframes = [df_with_groupings[df_with_groupings.Category == category] for category in categories]
        return dataframes

    def dfs_to_html(self, dataframe_list):
        '''
        Create html versions of dataframes for display
        :param dataframe_list: list of dataframes from dataframes_constructors function
        :return: list of tables to display on html in tables attribute of render_template
        '''
        tables = [df.to_html(classes=f'Category {i}') for i, df in enumerate(dataframe_list)]
        return tables

Code for KmeansGrouper class.

# kmeans grouper dashboard
@bp.route('/kmeansdashboard/<email>', methods=['GET', 'POST'])
@login_required
def kmeansdashboard(email):
    # get user data from database
    user = User.query.filter_by(email=email).first_or_404()
    data = user.upload

    # get columns
    columns = [*data.columns]

    # Take in kmeans data input
    form = KmeansInputForm()

    form.column1.choices = [(column, column) for column in columns]
    form.column2.choices = form.column1.choices

    if form.validate_on_submit():

        # initialize kmeans and prep data
        kmeans = KmeansGrouper(clusters=form.kmeans_group_nums.data)
        kmeans_prep_data = data[[form.column1.data, form.column2.data]]

        # Get groups_labels attached to data, and numpy version of data and prep for display
        kmeans_data, groups, data_np = kmeans.kmeans_categories(kmeans_prep_data)

        # get scatter plot for displaying group visualization
        plot = kmeans.scatter_plot(data_np, groups)

        # Attach additional columns
        # condition made since default output on add_columns is '', not None
        if form.additional_columns.data != '':

            # input additional columns and filter data by the entered and additional columns

            # > 1 additional column
            if ',' in form.additional_columns.data:
                additional_columns = form.additional_columns.data.split(', ')
                kmeans_data = pd.concat([data[additional_columns], kmeans_data], axis=1)

            else:
                # one additional column
                kmeans_data = pd.concat([data[[form.additional_columns.data]], kmeans_data], axis=1)

        # Create category tables
        dataframe_list = kmeans.dataframe_constructors(kmeans_data)
        dataframe_titles = [f'Category {i}' for i in sorted(kmeans_data.Category.unique())]

        flash('Plot and Tables displayed successfully!')
        return render_template('grouping/kmeansdashboard.html', user=user, email=user.email, form=form, plot=plot,
                               columns=columns, tables=kmeans.dfs_to_html(dataframe_list),
                               dataframe_titles=dataframe_titles)
    return render_template('grouping/kmeansdashboard.html', columns=columns, user=user, form=form)

Code for backend function that produces web page.


Wrapping Up

Thank you for reading! I hope the teachers reading this find the feature useful and the data scientists/engineers reading this find the code examples useful. This feature can be used for essay rubric data, project rubric data, short response question data, and multiple-choice points data such as the example below from my own Zipgrade account. Just remember to rename the columns to concepts you are assessing.

File downloaded from my Zipgrade account from a quiz given last year.

This is again a prototype and unfortunately, I do not have the time to dedicate to improve on it (that goes for the code refactoring also…). If you have any feedback please let me know below.

Want to talk more about how you or your school can start using this feature? Reach out to me! You can contact me via Linkedin, Twitter (DMs open), or directly at j.dejesus22@gmail.com.

Until next time,

John DeJesus

John DeJesus
John DeJesus

Data Scientist: Education | Python | Machine Learning | Flask | Mathematics

Leave a Reply

%d bloggers like this: