Can we identify 3-D images using very little training data?

Mina Zheng
Insight
Published in
10 min readOct 25, 2019

--

Image credit: https://vbti.nl/artificial-intelligence/

As the title suggests, this is a story about a question that may resonate well with many machine learning practitioners trying to build applications in the real world, where clean and annotated data on a specific problem can be sparse— How do we leverage the power of AI when we have very little data? In this blog, I’ll illustrate an approach by walking you through my project during my Data Science Fellowship at Insight, followed by a quick discussion pertaining to broader application.

The initial problem came from my client at Insight. They’re a tech company that specializes in developing software that enables remote monitoring of construction sites. They were interested in understanding the feasibility of identifying objects from 3-D images. Given that it is still relatively costly to generate and label such images in large quantities, they were particularly interested in knowing whether such tasks are feasible when training data are limited.

Problem Statement

For this project, my client specified a public dataset (T-LESS) of 30 industry-relevant objects and categorized them into seven distinct categories. Among these categories, one consists of only two objects (objects 27 and 29). This category was not considered for the purpose of this project as it does not allow for a 3-way partition for disjoint training, validation, and testing sets. As a result, I worked with 28 objects split into six classes, giving 4–6 objects per class.

My client also specified that CAD model files of the T-LESS dataset be used for this project, and that one object per class be reserved for testing (Objects 4, 8, 12, 18, 23, 30).

Challenges and Strategies

The difficulty of this project is two-fold. One being the 3-D nature of our images. The other, evidently, is that I only had a minuscule dataset to work with.

The latter is particularly restricting, as it violates the prerequisite of many deep learning methods for image classification — a large number of training images. In order to solve this doubly challenging problem in a limited time frame, I opted to work with 2-D projections and snapshots of the 3-D images. This opened up the opportunity to leverage the vast number of existing algorithms developed for 2-D images. Moreover, certain transformations can minimize loss of information by encoding depth of the 3-D object using color on a 2-D image, as shown in the next section.

The task of identifying objects from very few training examples is known as “few-shot learning”, a fast-evolving research area in computer vision. For my project, I decided to try two different strategies to tackle my small data problem — one by augmenting data, the other via meta-learning.

Two Pipelines

The two schemes below provide a high-level overview of the two pipelines and a quick glance of what the 3-D images and their 2-D conversions look like.

An illustration of the two pipelines I developed for this problem. They start with different strategies to transform 3-D into 2-D, followed by distinct machine learning approaches. The first one serves as a baseline for performance.
Three views for Object 12: CAD model as visualized by Microsoft’s Print 3D software (left). The first pipeline takes 8 color-less snapshots while introducing some randomness by setting the field of view parameter to random. Four of the 8 snapshots for this object are shown here (middle). The second pipeline takes one top view per object, but encodes depth information using color — red dots are closer to the viewer, while blue dots are further away (right).

Pipeline 1: Data Augmentation + Bag of Visual Words

The first pipeline is relatively straightforward and can be considered our benchmark for performance. To generate multiple 2-D images from each 3-D object, I took snapshots of the 3-D image from 8 different angles, and further varied the images by setting the field of view of each snapshot to a random integer between 30 and 90 (degrees). I should note that I designed this augmentation methodology specifically for generating 2-D snapshots from 3-D, so they’re different from common 2-D image augmentation methods such as those available from ImageDataGenerator in Keras, which can still be applied to the 2-D snapshots I created.

I then fed this labeled and augmented set of images into a typical bag of visual words (BOVW) pipeline, where I used SIFT to detect key points and extract features, then K-means to generate visual words. A good summary of BOVW, a classic algorithm in computer vision, can be found on this Wikipedia page.

Before feeding features into SVM classifiers (linear kernel), I normalized all input features using min-max, performed PCA to reduce input dimension, then supplemented the visual features with min-max-normalized measures of object sizes (diameters in 3 dimensions) from the CAD models.

Validation and Testing

As previously mentioned, one object from each of the 6 classes was held out for testing. This gave me 186 (= 22 x 8) images for training and validation. Instead of randomly partitioning these images into training and validation sets for cross-validation, I only split the images on object-level — that is, different 2-D snapshots from the same 3-D image can only be on the same side of the training-validation split. This institutes a consistency between validation and test set, that they both consist of objects — not just images — unseen by the trained model

The slight headache I had concerned the fact that the numbers of training objects per class were both small and uneven — after holding out test objects, as previously described, three classes had only three objects each, two others had four each, and the remaining one had five. To run the best cross-validation I could, I ensured the splits satisfied the following three rules for splits in a 5-fold cross-validation:

  1. In each fold, I left out only one validation object from each class
  2. Each object appeared at least once in validation
  3. All objects from the same class had equal chances of appearing in a specific validation set

One more thing to note here is that, in order to compare accuracy with the meta-learning approach, which evaluates its performance on 5-way classification, I repeated the classification six times, omitting one class each time, and averaged accuracy rates from the six iterations to derive my benchmark for performance on 5-way classification.

Pipeline 2: Meta-Learning via Prototypical Network

Overview

Meta-learning is broadly defined as learning by exploiting meta-knowledge about learning systems. In a narrower sense, meta-learning is a class of methods that “learn how to learn” by being “exposed to a large number of tasks” in training, then “tested in their ability to learn new tasks (Finn, 2017).” These methods have proven to perform well on few-shot learning problems. I chose to focus on prototypical networks for its conceptual simplicity and proven performance on the few-shot image classification task.

A schematic illustration of the meta-learning approach to this problem using Prototypical Networks.

Similar to a typical deep learning approach to image classification, a prototypical network first learns embeddings of images through a convolutional neural network, then uses a base learner (in this case, nearest neighbor operationalized as a softmax output layer in the network) to map the emeddings to classes.

What’s distinct about this meta-learning approach is that, instead of training the network to perform well on specific types of images (horizontally in the graph above), this method trains the network to learn an embedding space most well-generalized to classes of images it hasn’t seen before. Note that I’m NOT talking about images it hasn’t seen before. The key point here is that it should be generalizable to classes of images it hasn’t seen before, so that it can represent a new class of images well when provided very few training examples from that class.

Constructing an Episode

Towards the goal of generalizability, a critical aspect of this method is to construct the training problem to be faithful to its eventual testing environment. This means tasks (episodes) in meta-training, meta-validation, and meta-testing phases are all constructed in a similar few-shot, multi-way classification fashion. Each task consists of its own training and testing phases, assembled like a typical image classification task, where training and testing data are distinct images from the same classes.

Constructing the tasks used in my model. All tasks are 5-way classification. Meta-training used 15 shots per class. The meta-testing part illustrates a one-shot situation. Note that the authors who developed Prototypical Networks recommended using the same number of shots between meta-training and meta-testing.

To do this, two stages of random sampling need to happen for each task. The first stage samples classes from a distribution of classes, pre-partitioned into non-overlapping training, validation, and testing classes. The second stage samples individual images, non-overlapping between the within-task training phase and testing phase, from classes sampled in the previous stage; these images constitute the data used in each task. Model performance is evaluated on aggregated metrics of individual tasks.

CNN, Nearest Neighbor, and Gradient Descent

The architecture of the convolutional neural network (CNN) is shown to the left. With an input dimension of 84 x 84, I get a 1,600-dimensional output space. My 2-D projections for meta-testing were also rescaled to 84 x 84 in RGB.

In each task’s training phase, a prototype is generated for every class in the task as the center of all labeled training examples from that class. When a test image is presented, all prototypes are queried to find the test image’s nearest neighbor in the embedding space, using squared Euclidean distance. Unlike the loss function originally proposed by Snell et al., I used log-likelihood loss following Lee et al.’s example.

I trained the model using gradient descent with eight episodes per mini-batch. A total of 22 epochs (1,000 episodes per epoch) were iterated, with learning rates of 0.1 for the first 19 epochs, and 0.06 for the remaining three epochs. Nesterov’s momentum of 0.9 and a weight decay of 0.0005 was used throughout. The choice of 22 training epochs is one of convenience due to the time constraint of Google Colab, which I ran the model on.

In more recent work, increased feature dimension has been used (e.g. 16,000 rather than 1,600 used here) in combination of L1 regularization to improve the power of the network without over-fitting.

Model Performance

In the meta-learning pipeline, I followed the standard setup for performance evaluation in few-shot image classification, where K-way N-shot tasks are sampled from meta-testing classes many times to generate an average accuracy score. I chose K to be 5. To compare performance between the two pipelines, I also constructed testing on the data augmentation + BOVW approach to be 5-way, as described earlier.

Here, a random guess for 5-way classification gives us an accuracy of a little less than 20% due to slight class imbalance.

Using as few as one training example per class, the prototypical network was able to achieve almost equal performance as BOVW using augmented snapshots from all available training data. As training shots increase to three, the prototypical network achieved over 70% accuracy in 3-way classification.

Summary of Project

The daunting task of classifying 3-D objects using very few examples proves to be feasible. Meta-learning methods, such as prototypical networks, can be especially powerful in this application. Given the few-shot constraint of the task, I did not venture into 3-D image processing/feature extraction. To perform meta-learning on classification of 3-D images, a large labeled database of such images is a prerequisite for meta-training.

Further technical details can be found in this repo on my GitHub. My code for prototypical networks is adapted from MetaOptNet.

Semi-Supervised Extension of Prototypical Networks

One potential real-world use case for this classification pipeline is to leverage crowd-sourced labeling to generate support cases (training data) during meta-testing for novel categories of objects. To handle noise in a training pool that’s not perfectly clean, one may want to leverage an extension of the prototypical networks that uses masked soft k-means to selectively incorporate information from unlabeled distractor categories when generating class prototypes, following the example by Ren et al.

Meta-Learning v.s. Transfer Learning

A reader familiar with applications of deep learning may wonder whether and how meta-learning differs from transfer learning. The similarity between the two is unmistakable — both involve “transferring” knowledge, often in the form of a pre-trained neural network, learned from one data system to another. The distinction, however, is paradigmatic — meta-learning systems have the goal of generalizability baked into their whole design, which is much less the case for transfer learning. Meta-training always aims to minimize generalization error on unseen tasks. Meta-learning algorithms often prefer simple and fast base learners (e.g. nearest neighbor, SVM) or quick weight updates and emphasis on reducing over-fitting.

Quick Summary on Meta-Learning

  • Powerful in low-data regime, where data are hard to gather and/or hard to label
  • A successful application requires parallelism between low-data and data-rich tasks
  • Borrows knowledge from data-rich tasks to allow fast learning on low-data tasks; similar to transfer learning but has a different paradigm
  • Other use cases: classify rare diseases, AI translation of low-resource languages, etc.
  • Beyond supervised learning: semi-supervised learning, reinforcement learning

Are you interested in working on high-impact projects and transitioning to a career in data? Sign up to learn more about the Insight Fellows programs and start your application today.

--

--