Hitting the Gym with Neural Networks: Implementing a CNN to Classify Gym Equipment

Lauren Holzbauer

Published in

Insight

16 min readJan 14, 2020

Will a network trained with fake data be able to generalize to the real world?

Lauren Holzbauer was an Insight Fellow in Summer 2018.

Today, I don’t think twice about walking into any gym, assessing the equipment, and throwing a really good workout together, but it hasn’t always been that way. The first time I walked into a weight room was six years ago, while I was at UCLA. The place was teeming with frenzied bros and the only piece of equipment available was this strange looking thing in the corner. Later that night I found myself Googling, “gym equipment diagonal feet platform cushion triangle.” It took several weeks of Googling to finally figure out that it was called a hyperextension bench.

In an effort to help women feel more comfortable venturing into the weight room, I built a web application that allowed a beginning weightlifter to walk into any gym, upload a photo of any piece of equipment, and receive a curated list of YouTube videos (made by women) demonstrating how to use it.

In this post, I will walk through how I built the image classifier for this project using two different implementations of convolutional neural networks (CNNs). I will also discuss how I navigated specific challenges with data gathering. Take a closer look at the three benchpress images above. Does anything look fishy to you…? Read on to find out more!

Why use a CNN?

CNNs have been widely considered state-of-the-art tools for computer vision since 2012, when AlexNet won the ImageNet Large Scale Visual Recognition Challenge (ILSVRC).

CNNs are especially suited for working with images because:

They preserve spatial information by accepting a 2-D array as input. In order to send an image through a vanilla NN, you would have to flatten it into a 1-D array.
They keep the number of weights from exploding via partially connected layers (convolutional and pooling layers).
They can detect the same feature anywhere in the image. A vanilla NN can only detect features that are in identical positions within the image.
They are structured hierarchically, which is similar to how animal brains process images. This structure allows the same network (i.e. a “pre-trained” model) to be recycled and reused for many different tasks. We will walk through how to set up a pre-trained network below!

For a deeper dive into the mechanics of CNNs, check out my previous post, “Convolutional Neural Networks Explained…with American Ninja Warrior.”

The Data: Building out a training set with floating gym equipment

Challenge Accepted: What to do when you don’t have enough data

NNs simply need A LOT of training data in order to learn meaningful patterns. This is one of the reasons why deep CNNs were not trained successfully until the late 2000s; it wasn’t until the explosion of the Internet that researchers had enough images to compile very large training datasets. This implied that I would need many, many photos of gym equipment with which to train my image classifier.

I focused on five classes of equipment:

benchpress
hyperextension bench
plyometric box
leg press
power rack

I set a goal of acquiring roughly 1,000 images per class and began combing through Google Images. Displayed below are a few results from my initial search:

Groupings of three images for each class of gym equipment, representative of initial search results.

I assumed that collecting lots of photos of gym equipment would be pretty straightforward. Wrong! Though the images I gathered had good variation in object orientation, there was no variation in background whatsoever: all of the images had plain, white backgrounds. This was a big problem since building a good image classifier requires provisioning it with as many examples as possible so that it can generalize well. On top of this issue, I only had between 40–50 unique images for each class.

In short, I was faced with two major difficulties regarding data collection: I didn’t have nearly enough images, and the images I did have were not representative of a realistic gym environment. This could lead to a classifier that performs poorly in a real gym. Real gyms have noisy backgrounds with mirrors, windows, people, and other equipment.

What the heck! Let’s train a CNN with fake data!

I was familiar with the strategy of using data augmentation techniques (such as flipping, shearing, zooming, etc.) to artificially inflate the size of a training dataset (check out this relevant Medium post and this one, too!), but I wondered if a CNN would still be able to learn from images that had been even more drastically augmented… to the point of being… totally fake!

The idea was to take the images for each class that had white backgrounds and superimpose them onto random images of gym interiors in order to generate a larger number of simulated images. Though these simulations won’t be perfect, they’ll probably look a lot closer to a real-life photo than what we started off with. Take a look below to see a few simulated images being generated:

Generating simulated images using a variety of gym backgrounds and equipment images.

I decided to use something called Pillow for the ensuing image processing tasks. Pillow is a fork of the Python Imaging Library (PIL), and is along the same lines as OpenCV. You can view the documentation and check out a quick tutorial.

In order to generate the simulated images en masse, I built a pipeline with Pillow that:

converted the white background in the foreground image to a transparent channel,
cropped, resized, and padded both the foreground and background images,
superimposed the foreground onto the background, and
slightly blurred the new image with a Gaussian filter to blend the two layers together.

You can check out the full code in my Github repo for this project or take a look at the code snippet below to get a basic idea of how Pillow works.

I scraped over 150 images of gym interiors from Google Images and combined a random subset of 25 of them with each gym equipment image. I sent each foreground-background pair through the pipeline, resulting in over 5,400 simulated images!

Additional image augmentation with Keras

I chose to implement my image classifiers for this project with Keras. Keras is an open source deep learning API that was written in Python and runs on top of Tensorflow, so it’s a little more user-friendly and high-level than Tensorflow. Check out the Keras documentation here.

What’s convenient about Keras is that it performs data augmentation in-memory. This means that for each batch of training instances, Keras will augment each image using a series of random transformations (e.g. flipping, shearing, and zooming). Remember that a neural network is trained over many epochs (the number of batches it takes to go through all of the training instances once is equivalent to one epoch). This means that each image in the training dataset will be sent through the network multiple times during backpropagation. Keras’ in-memory augmentation ensures that the same image will never be sent through the network twice. This helps prevent overfitting by giving the network even more examples to learn from.

Take a look at a few examples of augmented images generated with Keras:

Examples of augmented images generated using in-memory random transformations with Keras.

Now that we’ve generated our simulated images, we’re ready to answer a very interesting question: can a CNN learn from “fake” data? Let’s try out two very different implementations of CNNs and see which one wins!

Two CNNs battle it out! In this corner… Implementation #1: CNN built from scratch

The first implementation I tried was a CNN built and trained from scratch. This CNN has eight layers: three convolutional layers (each followed by a ReLU activation and max pooling layer), then two fully connected layers. The final fully connected layer (output layer) has a softmax activation. Note that since we have five classes, the output layer has five neurons. The architecture is displayed below. You might notice that this architecture is pretty similar to that of LeNet-5.

Architecture of the CNN built from scratch.

Each network layer in the image above is labeled with its type (at the top) and the size of its output image (at the bottom). The input image was resized to 150x150 pixels and converted to grayscale. Keeping all three RGB channels from the original color image would increase the number of weights the network would have to learn and lengthen training time. Notice how the image gets smaller and deeper as it progresses through the network. The first convolutional layer outputs 32 148x148 pixel feature maps, while the subsequent pooling layer decreases the size of each map to 74x74 pixels.

Take a look at the Keras implementation of our CNN-from-scratch below.

Keras code is very intuitive. Let’s walk through the sample code above:

The convolutional layers

In the code above, you can easily make out the 3 convolutional blocks (each with a ReLU activation and max pooling layer). The 2 parameters passed to Conv2D are the number of feature maps and the size of the kernel.

The fully connected and output layers

The fourth block of code implements the final 2 fully connected layers. In this block, we first “flatten” the output tensor from the last pooling layer. This just means that we convert the tensor from rank 3 to rank 1. A rank 1 tensor is basically a 1-D array, but remember that it’s technically a tensor, not an array. That’s why they call it Tensorflow! Check out this great Medium article that discusses the difference between tensors and multidimensional arrays.

The flattened tensor is then fed into the first fully connected layer (belonging to the class, Dense, in Keras). Then, a ReLU activation is applied followed by something called “dropout.” Just like Lasso and Ridge are regularization techniques that can be applied to linear regression models, dropout can be applied to neural networks to prevent overfitting. Read more about dropout in this excellent, in-depth CNN discussion.

The final dense layer serves as the output layer; we assign it the same number of neurons as the number of classes (in our case, we want to distinguish between 5 classes of gym equipment so we use 5 neurons). The softmax activation converts the raw output value for each neuron into a probability, so we wind up with 1 probability score for each class. The class with the highest probability is the winner.

Choosing your loss function and optimizer

Finally, in the last block of code, we must compile the model that we just built. We pass 3 parameters: loss, optimizer, and metrics. The loss function just refers to the cost function, which is some functional representation of the error (all of the loss functions that Keras supports can be found here). The goal here is to optimize the weights such that the loss is minimized. The optimizer is the algorithm that figures out exactly how to tweak the weights for each step during backpropagation. In linear regression, the loss is the mean-squared-error and the optimizer is gradient descent.

Gradient descent usually works well but can run into a few challenges in some situations. For example, it can get stuck in local minima or take a long time to converge if you choose a sub-optimal learning rate. Also, the same learning rate applies to all parameter updates, which means that all of the weights are tweaked by the same magnitude for each step. This isn’t always a great methodology (e.g. if you have very imbalanced classes, your network won’t have many opportunities to learn from instances belonging to the rare class).

Many optimizers have been developed over the years which improve upon gradient descent. “Adaptive learning” optimizers hold the gradients from previous steps in memory and compare them to the gradient at the current step. This allows them to figure out when they’re approaching the “bottom of the bowl” at which point they slow down and take smaller steps. A few adaptive learning optimizers that work well in practice are Adadelta, RMSprop, and Adam. However, determining which one to use can be more of an art than a science. If you’d like to read more about all of the different optimizers out there, check out this excellent Medium article and this wonderfully detailed blog post.

Choosing your evaluation metric

Lastly, metrics just refers to your choice of evaluation metric. You can use this metric to track progress during training or choose between different models. For example, if you notice that, during training, your evaluation metric for the test set suddenly begins to plummet but the metric for the training set remains high, that’s probably a sign that your model is overfitting. The documentation for Keras’ metric functions can be found here.

Before we reveal the results of our CNN from scratch, let’s move on to our second implementation and try to guess which one performed the best!

In this corner… Implementation #2: “Transfer Learning” with VGG16

How does transfer learning work?

For our second implementation of a gym equipment classifier, we’re going to use VGG16 (or…part of it). The VGG16 architecture was the runner-up in the ILSVRC-2014 challenge. VGG16 was developed by the “Visual Geometry Group” (VGG) at Oxford. It was trained on a subset of the ImageNet dataset, which is the world’s largest and most comprehensive database of hand-annotated images. ImageNet contains over 14 million images that fall into over 20,000 categories. All of the ILSVRC-2014 challengers were trained on a subset of ImageNet that contained 1,000 out of the total 20,000 categories. The 1,000 categories included everything from dogs to pianos to space shuttles (the full list can be found here).

The idea of “transfer learning” is that part of a “pre-trained” network can be recycled and reused in another network. This saves a lot of time because we now only need to train the non-recycled layers of the new network. It’s also helpful if we have limited training data. If the pre-trained network was trained on a vast dataset, then recycling layers from that network can effectively “transfer” the knowledge gained from the large dataset to the new model, which can only “see” the small dataset. Transfer learning is basically like cheap cable tv for a resource-constrained network.

You might be asking: how is it helpful to recycle layers from VGG16, which was trained to classify dogs and pianos, when we want to build an image classifier for gym equipment?

Remember that a CNN learns feature hierarchies. Features from higher levels of the hierarchy are formed by the composition of lower-level features. The hierarchy of a warped wall is displayed in Figure 5. The lower layers learn low-level features while the upper layers learn high-level features. While high-level features are very specific to each unique class, low-level features, like edges and corners, are shared by many different classes. For example, a piano and benchpress probably share lots of low-level features. This is why, when using a pre-trained network, we can get away with recycling the lower layers. However, we must re-train the top layers.

Overview of the VGG16 architecture

VGG16 has 16 layers, not including the pooling layers and output layer. Its architecture is considered one of the simplest out of all of the state-of-the-art CNNs, especially compared to architectures like GoogleNet and ResNet. In the image below, you can see that one of the major differences between VGG16 and LeNet-5 is that VGG16 stacks convolutional layers one on top of the other, instead of following each one with a pooling layer.

The VGG16 architecture. The “frozen” pre-trained layers that we recycle and reuse in the new network are labeled with snowflakes.

The layers that are labeled above with snowflakes have been “frozen.” These are the layers that we’re going to recycle. Notice how everything is frozen except for the fully connected layers. When we build the new network, we will reuse the frozen layers from VGG16 and build 2 fully connected layers on top of that. When we train the new network, the weights for the recycled layers will remain frozen during backpropagation while the weights for the unfrozen layers will be tweaked and optimized.

Implementing a pre-trained model

Let’s take a look at some code to see what the implementation looks like!

In the interest of time and space, the code above assumes that we have already trained the weights for the fully connected layers and that we are ready to make a prediction. If you want to see how the weights were trained, check out the full code in my Github repo or read this excellent post in the Keras blog that talks in depth about how to do this.

In order to make a prediction on a new image, we first resize the image, load the resized image into an array, and convert to grayscale (so that the size and color scale are consistent with the training images). Next, we use the expand_dims function in NumPy to add an extra pair of brackets around the image array so that it’s in the correct format (we need the outer brackets because the input can contain multiple arrays if the image has multiple channels).

Now we’re ready to send the input image through the network. We implement the network in two parts: first, we send the input through the frozen layers of VGG16, then build the new layers on top of that.

Let’s begin with the first part. Another reason that Keras is so convenient is that you can easily instantiate pre-loaded models, such as VGG16. In line 8, we instantiate the frozen layers of VGG16 by passing 2 parameters to applications.VGG16():

include_top=False, which leaves off the “top” layers (the fully connected layers), and
weights=’imagenet’, which loads the weights that were trained on the ImageNet dataset.

We send the input through these frozen layers in line 11 and store the output in the variable, bottleneck_prediction. This output will be the input for the second part of the network.

In lines 14–18, we implement the second part of the network: a free-standing network with 2 fully connected layers, including the output layer. In line 21, we load the weights that we previously trained for these top layers and saved in a file called “top_model_weights.h5”. Finally, we pass bottleneck_prediction to model.predict_classes() in order to make the final prediction.

If you are leveraging a pre-trained network in your model and you find that your evaluation metric is still subpar, you can proceed by un-freezing even more layers and re-training the weights for all of the un-frozen layers. In other words, you can keep removing snowflakes and re-training more and more of the weights until you are happy with the results.

As we discussed earlier, VGG16 was trained on over 1,000 categories of images including Dalmatians, space shuttles, and pianos. Of those 1,000 categories, only one (“dumbbell”) fell into the broader category of gym equipment and does not resemble any of our 5 classes. I was skeptical about whether or not this implementation would work. In the next section, we’ll find out!

And the winner is…

Both implementations of our gym equipment classifier achieved 99% accuracy on the test set. But remember, the test set is made up of simulated images that we withheld for testing. Since our ultimate goal is to deploy the classifier in a real gym, it doesn’t really matter how the model performs on simulated images. It just assures us that the model learned everything it could from the fake data. The real question is whether or not a model trained on simulated images will be able to translate what it learned to the real world.

To evaluate this, I compiled a validation set of over 100 real-life gym images. A few of these images are displayed below. Each image is annotated with a pair of checkmarks indicating whether or not it was correctly classified (green check) or misclassified (red check) by each model we tried. The check on the left side corresponds to Implementation #1 (CNN-from-scratch) while the check on the right corresponds to Implementation #2 (pre-trained network).

A selection of real-life validation images.

You can see from the image above that our two implementations didn’t always agree. Interestingly, both implementations were able to correctly classify the leg press and power rack images that involve humans using the equipment, even though none of the simulated images included humans. Let’s take a look at the confusion matrices to see which implementation did a better job overall.

Confusion matrices for Implementation #1 (CNN from scratch, left) and Implementation #2 (pre-trained network, right).

The above image displays two confusion matrices side-by-side: one for each implementation of our gym classifier. The results for the CNN-from-scratch are displayed in the left matrix, while the results for the pre-trained network are displayed on the right. Each row corresponds to the actual, ground truth label of the validation image while each column corresponds to the predicted label. Images that were classified correctly are tallied up along the diagonal. Misclassified images live off the diagonal.

And the winner is… the pre-trained network! The CNN from scratch achieved overall accuracy of 85% on the real-life validation set while the pre-trained network achieved 92% accuracy. I was surprised at how well the pre-trained model worked. VGG16 was trained on such a vast dataset that it learned a smorgasbord of low and mid-level features. These features are versatile enough that we can use them to effectively classify gym equipment, even though VGG16 was not trained on any images of gym equipment!

Summary

I hope you enjoyed venturing into the gym with CNNs! In this post, we tried out two implementations of a gym equipment classifier and discovered the surprising ability of transfer learning. We implemented a gym equipment classifier using a CNN built from scratch and a CNN that utilized transfer learning. The transfer learning implementation achieved 7% higher accuracy than the from-scratch implementation.

A CNN learns feature hierarchies. Features from higher levels of the hierarchy are amalgamations of lower-level features. Lower layers of the network learn low-level features while upper layers of the network learn high-level features. Objects that are very dissimilar on a high level can still share many low-level features. This is why we can reuse lower layers of VGG16 in our gym classifier, even though VGG16 was not trained on images of gym equipment.

Most interestingly, we found that both implementations of CNN were able to generalize very well to the real world even though they were trained with fake data. If you ever find yourself bogged down by a small dataset, don’t be afraid to get creative!

So the next time you head to the gym, don’t forget your CNN!

Are you interested in working on high-impact projects and transitioning to a career in data? Sign up to learn more about the Insight Fellows programs and start your application today.