AI for 3D Generative Design

Tyler Habowski
Insight
Published in
13 min readMar 20, 2020

--

Making the design process faster and more efficient by generating 3D objects from natural language descriptions. See the live demo here: datanexus.xyz

Standard design process

The same fundamental process is used to design every modern physical object from tables to spacecraft. How can we use AI to transform this tedious technical exercise into a simple and natural interaction between the computer and designer?

Interpolating between different swivel chairs designs, read on for details!

The Motivation

Currently, the basic design process for physical parts has several drawbacks:

  • It is highly manual and requires technical expertise. Every dimension and feature of each part must be precisely defined using complex, domain-specific software tools in order to produce a usable design.
  • There is a negative feedback loop that commonly occurs when something goes wrong in validation, testing or production which requires redesigns, product recalls, as well as a loss of employee time and company profit.
  • Designers' and engineers' creativity in exploring and developing the design space is limited by how fast they can iterate and generate new designs.

All or any one of these drawbacks can lead to good designs being overlooked or mistakes being made in the original design that lead to problems at a later stage in development.

Using semi-supervised learning to encode the prior knowledge in a way that can be interacted with intuitively would allow designers to iterate faster, make fewer mistakes, and more deeply explore the design space.

To make progress towards this lofty goal, my project aims to make the initial step in this process faster and more efficient with ML by generating simple, everyday 3D objects from natural language descriptions. The code for this project can be found here and slides relating to the development of the project are here.

My proposed design process with Machine Learning

Background Research

Several recent papers have investigated similar ideas as this project; however, none of them captured the specific intent I was aiming for and so I ended up taking inspiration from these models but going in a new general direction. Specifically, I wanted to be able to generate objects from at least 10 different categories (the papers below capture only 2–3) and I wanted to develop the model architecture with the capacity to extend to unlabelled 3D shape data.

  1. Text2Shape Stanford, 2018 Paper Site
    Approach: This paper used a multimodal learning system to cluster text embeddings and shape embeddings together so similar descriptions were clustered with similar objects. Then they used a Conditional Wasserstein GAN (Generative Adversarial Network)to generate the 3D models from the text embeddings.
    Drawbacks: The training time on the GAN was very long and would have taken weeks to train with my available hardware and it was only trained on 2 categories (tables and chairs) whereas I wanted more diversity. Additionally, the models were highly complex and had ~20 million total parameters and were not feasible to run performantly online.
  2. StructureNet Stanford, 2019 Paper Site
    Approach: This model uses a hierarchical graph-based variational auto-encoder (VAE) to reconstruct models based on graph nodes and bounding boxes.
    Drawbacks: The model learns to generate highly realistic shapes but the annotations are very dense and intensive to create, requiring deep, manually annotated part trees for every model. To make my project more extensible, I wanted to investigate ways of using much simpler annotations in my model like short text descriptions.

Data Processing

To produce an encoded knowledge base for this design space I chose to use the PartNet database (a subset of ShapeNet) which has ~30k densely annotated 3D models across 24 categories. From these annotations and heuristics on the models, I made simplified text descriptions. From the 3D models, I created 3D voxel volumes (voxels are like pixels in 3D) to represent the model in a way that could then be fed into a neural network architecture.

Example: PartNet models with different colors showing the distinctly annotated subparts

Generating the text descriptions from the annotations turned out to be a deceptively difficult task. The annotations were effectively a tree of parts, categories, and subparts which can easily lead to unnaturally formulaic descriptions that are specific, but too cryptic to be useful.

The key insight in successfully generating the descriptions was that an object is fundamentally described by what makes it different from other similar objects. Using this idea, I utilized a basic form of TF-IDF on all of the categories and subparts in the part trees to see which specific entries in the tree are relatively rare compared to similar objects in that class.

Building my text descriptions into a 3D model database

Below are some randomly sampled generated text descriptions:

  • a large easy chair that is very large and regular width. it has four back supports and four legs.
  • a vase. it does not have a lid. it is average size, average height, and empty.
  • a very wide and very short clock that has a screen and a frame.
  • a mug that is average height and empty and average size. it has an average size handle.
  • a usual height table lamp. it has one head, a base, and a cover. it does not have any feet.

From the 24 categories in PartNet I narrowed it down to 11 categories to use for my project. The selection was primarily based on how many samples of the category were available in the database versus how diverse the samples were. Some categories like tables and chairs are well represented, they have a large diversity in shape but many thousands of samples. However, some like the bed category have large shape variety but a relatively small number of examples (a couple hundred) which makes it especially challenging.

Categories the model was trained on:

  • Table (total number of table examples = 8436)
  • Chair (6778)
  • Lamp (2318)
  • Faucet (744)
  • Clock (651)
  • Bottle (498)
  • Vase (485)
  • Laptop (460)
  • Bed (233)
  • Mug (214)
  • Bowl (186)

Ultimately, this preprocessing produced a pairing of ~6 text descriptions per 3D voxel volume (more complex objects had more descriptions and simpler objects got fewer descriptions).

Model Architecture

With the prior knowledge database defined, it now has to be encoded in a way that can be intuitively interacted with and sampled. To achieve this goal, two separate models were developed and trained.

Model training architecture for the shape autoencoder and the text encoder

First, a convolutional variational autoencoder (VAE) was used on the 3D voxel volumes in order to produce a decoder model that could take in latent space vectors and produce a design. The encoder model was also used for generating the database of latent vectors for the text model.

Second, a text encoder model was used to generate the latent space vectors from the text descriptions. It was trained to directly predict the encoded latent space vectors of known models from their descriptions generated earlier.

These two models were trained separately and then they were combined into one model that goes from the initial text description to a latent space vector, and then through the decoder model to generate a 3D design.

Shape Autoencoder

The shape autoencoder was highly successful at generating and interpolating between many different kinds of objects. Below is a TSNE map of the latent space vectors colorized by category. Most of the clusters are clearly segmented with some overlap between similar designs, such as tall round lamps and bottles. While there are a few out of place samples like some tables in the chair region, inspecting them manually shows that these models are indeed quite odd and in fact closer in shape to the surrounding models than to their parent category.

Tech Specs: The model used five 3D convolutional layers for both the encoder and decoder and had a latent space vector of 128 dimensions. Dropout and L2 regularization was used to make the model more generalizable. In total, the model had ~3.2 million parameters and took ~30 hours to train on a single Nvidia V100 GPU.

TSNE map of the latent space vectors colored according to object category.

The GIFs below show random walks between encoded shape vectors of different designs. It demonstrates that the model has learned how to smoothly interpolate between disparate geometries.

Interpolating between different types of swivel chairs (many more GIF examples on my github, can also be generated realtime on the app here)

Text Encoder

The text encoder model was moderately successful at predicting the latent space vectors given an input description. Predicting 128 separate continuous numbers is a difficult task and it had to effectively reverse engineer how the 3D encoder model worked on top of interpreting the text. This difficulty is compounded by the fact that there is a large variety of models that match a given text description, especially when the descriptions are short, e.g ‘a regular chair’ or ‘a wide bed.’

Tech Specs: SpaCy (GloVe) word embeddings were used to encode the text tokens as vectors with a max sequence length of 50 words. It then uses three bidirectional LSTM layers and four fully connected dense layers to generate the latent space vectors from the text descriptions. Spatial 1D dropout was used on the word embeddings as well as regular dropout and L2 regularization of all the subsequent layers. The total parameter count for the model was ~3.1 million and it took ~25 hours to fully train.

Tech Stack For Training Models

To train the Tensorflow based models, I used AWS EC2 (Elastic Computer Cloud) spot instances with EBS (Elastic Block Storage) drives storing all data. Spot instances allowed me to use p3.2xlarge instance types at ~1/3 of the cost of on-demand which enabled significantly more training tests to be run. Using the p3.2xlarge instances accelerated training by ~6x but were only ~3x the cost compared to p2.xlarge instances and so it was significantly faster and more cost effective. Additionally, I created a custom AMI (Amazon Machine Image) and launch template to drastically reduce setup time for new instances.

In order to store the ~90 GB of training data and quickly recover from spot instance terminations, I used several dedicated EBS drives with the same data structure so I could simply plug them into new spot instances as required. To download the training data and set up all of the EBS drives I used a persistent m4.large instance instead of the more expensive GPU accelerated instances.

In order to deploy code and synchronize data from training runs on multiple instances I used rsync. I also developed a logging class that created and managed training data for each run into a consistent folder structure.

Putting It All Together - Final Results

To deploy the final model, I made a Streamlit app where the user can type in a text description and interactively view the generated 3D model. This app utilizes Streamlit For Teams and reads directly from my github repo to smoothly manage and deploy code. The app also allows the user to interact with the encoded design space via an interactive version of the TSNE map shown earlier.

Try it out here: datanexus.xyz

Output models for various simple descriptions showing how the model changes
Input was: ‘a chair that looks like a lamp’.

Results Discussion

The final model is easy to interact with and demonstrates clear signs of understanding the text input with a wide variety of descriptions. In particular, shape descriptors like ‘wide’ or ‘tall’ are well interpreted and have a reasonable effect on the output. Even some odd descriptions that are outside of the intended scope produce suitable results such as ‘a chair that looks like a lamp’ (shown above).

Where the text encoder model seems to have trouble is with very short or one-word descriptions because the descriptions in the training set were on average ~13 words long. These short descriptions are so vague that it has a hard time finding an appropriate average model across all possible models that could match that description. The model also occasionally fails to pick up on some small details in the description that should lead to a large change in the output, like whether the mug is full or empty. It seems that this information can sometimes be lost as the LSTM goes along the text sequence.

Perhaps the most notable constraint with this approach, however, is that it isn’t able to generate models that are outside of the scope of the training data. There were no chairs with ten legs in the training set and so it isn’t able to extrapolate and generate something with ten legs. It’s likely that an entirely different approach would be required to achieve this. However, my overall goal in this project was to encode the design space in an intuitive way to enable rapid exploration of that design space and for that purpose generating entirely new novel and novel designs was less critical.

Conclusion

Ultimately, this project successfully demonstrated a novel way of applying ML to accelerate the 3D design process for simple objects. Using unsupervised learning to encode a large database of prior knowledge and then using supervised learning to build a natural language text model that can interact with the encoded database, allows the user to sample from and explore a knowledge base quickly and easily. While the models generated by this project were relatively simple, there are many ways to extend this idea to more complex models as explained below. This approach takes the first steps towards a future where designers can work together with ML algorithms to focus on the creative aspects and iterating rapidly on designs in order to achieve results on an unprecedented scale.

Further Work

To expand on this work there are many possible directions to take this. Here are a few:

  1. Expanding to more classes and unlabeled objects.
    This project was intentionally setup so that the unsupervised learning part (the shape autoencoder) and the supervised learning part (text encoder) were separated. In general, beyond PartNet, the available amount of densely annotated models is far less than the number of unlabeled models. But using the shape autoencoder it should be possible to greatly expand the diversity of the final generated models by training the autoencoder on a much wider variety of unlabeled objects.
  2. Incorporate attention models into the text encoder.
    This would help the model focus on specific nouns and descriptors that are critical to the output design like ‘chair’ vs ‘table’ or ‘small’ vs ‘large’. It would also help the text model to remember small details in the text descriptions that should lead to large changes in the output model like whether a mug is full or empty.
  3. Use a more conversational model.
    For instance, if the starting input was ‘it is a chair’ it would find what properties are generally associated with chairs and ask relevant clarifying questions like ‘does it have armrests?’ or ‘how many legs does it have?’. These relevant properties could be gathered dynamically by looking at the properties of nearby neighbors in the TSNE latent space. The similar descriptions suggestion feature in my demo application is a simple first step towards how this could work.
  4. Incorporate other input data than only text.
    It may be that natural language text is not sufficiently specific and descriptive enough to generative useful 3D models. After you have a basic model made with the descriptions, using other input like data sliders for attributes such as height, width, number of chair legs, etc could produce very specific models while still being simple to use.
  5. Switch away from voxels entirely.
    Recent progress has been made with systems like point cloud generators or mesh deformation models and it would be interesting to see if those advances could be applied to generative design as well. Perhaps something like a graph-based tree similar to the StructureNet paper described earlier could be adapted to work in a similar way as this model.

Lessons Learned

These are some interesting lessons learned during my project:

  • Centering the voxels resulted in ~5x speedup in training and a significantly more generalizable model.
  • Storing voxel volume training data as an array of booleans and only casting to floats in batches for training allowed tens of thousands of objects to be stored in memory at once as opposed to only a few thousand. This sped up training time by ~20% and made it more feasible to expand the scope of available categories by including more models.
  • Regularization helped significantly, added dropout (0.2) and L2 (0.001).
  • For this use case, the available word embeddings were not ideal but still better than training from scratch. With GloVe, the closest vector relative to ‘large’ was ‘small’, the closest to ‘tall’ was ‘short’, etc. This was because the word embeddings were effectively trained to predict the next word in a sequence and in these cases either word could be used almost interchangeably. As a result, the text encoder struggled to fully understand concepts like ‘small’ and ‘large’. However, using frozen pre-trained embeddings was still much more generalizable and practical than training custom word embeddings. This is because the procedurally generated descriptions had a very limited vocabulary (~500 words) and using the pre-trained word embeddings kept similar words reasonably close. When I tried training the embeddings from scratch, or even just fine-tuning them, the text encoder strongly overfit to the small procedural vocabulary and became too brittle.
  • Inspired by the text2shape paper described earlier, I attempted to train a DC GAN (paper) to generate 3D shapes from the PartNet database but this model was not successful. After many hours of training, it was only capable of generating blobs near the general centroids of the objects but lacked any fine details. It’s unclear if simply more training would have created a better result or if significant rework to the GAN architecture was required. I suspect that using a conditional GAN that generates models based on the text embedding vectors would improve performance instead of sampling directly from the entire available random vector space.

Tyler Habowski is an Insight AI Program Fellow in San Francisco. He spent the last five years as a vehicle design engineer at SpaceX where he developed reusable flight parts and systems for the Falcon rocket family.

Are you interested in transitioning to a career in tech? Sign up to learn more about Insight Fellows programs and start your application today.

--

--