Contextual Topic Identification

Identifying meaningful topics for sparse Steam reviews

Steve Shao
Insight

--

sds
Steama video game digital distribution service.

Steam is the largest online video game selling platform. While it has a very carefully designed customer review system, reviews can only be categorized as “positive” and “negative”, which is not very informative.

Product reviews can be an extremely important tool for customers to make purchasing decisions. However, since we are dealing with sparse data, there are limited reviews for each product, making it difficult to get useful information from them. In this blog, I’ll describe how to solve this with Contextual Topic Identification, leveraging machine learning methods to identify semantically similar groups and surface relevant category tags of the reviews.

Steam review screenshot

Dataset

I used the Steam Review Dataset, published on Kaggle as a .csv file. It covers 380,000 video game reviews from 46 best selling games on Steam. The raw text data needed to be carefully pre-processed before I could feed it into a model, so I used RegEx to filter certain patterns for normalization and SymSpell for typo correction.

Data Pipeline

For the data pipeline, the raw text data was pre-processed using RegEx and other normalization skills. I used exploratory data analysis (EDA) to get insights about how the data is distributed and structured to see how the preprocessing could be improved. Then, leveraging Python frameworks, I implemented the topic identification models. Note that we can always go back from model building to EDA and make it an iterative process, in the sense that we can get more insights with regard to what methods are appropriate for our specific data and task. Once we come up with a final model, we may scale it to be a production model using Docker, AWS, or a similar cloud service.

Data pipeline (from development to deployment)

Methods

Typically, there are two ways to complete the topic identification task. We can either resort to hierarchical Bayesian models, like latent Dirichlet allocation (LDA), or we can embed our target documents into some vector space and identify their similarity structure in the vector space by clustering methods.

For this task, specifically with the Steam review data, LDA has its limitations:

  • It has a hard time handling short texts when there is not much text to model
  • The reviews usually don’t coherently discuss a single topic (making it hard for LDA to identify the main topics of the documents)
  • The actual meaning of reviews is largely context-based, so word co-occurrence based methods like LDA might fail.

With these limitations in mind, we need a technique that embeds the full content of the sentence, which can then be clustered with similar topics. Therefore, embedding plus clustering seemed to be the most straightforward here. I tried different methods to get vector representations from the reviews, including TD-IDF, BERT sentence embedding, and my final method, contextual topic embedding (LDA + BERT).

TF-IDF

For TF-IDF, the result was not good, in the sense that we can neither get separated nor balanced clusters. Because TF-IDF is also bag-of-words based (disregarding grammar and word order), it loses contextual information and suffers from the data being incoherent and unstructured.

Clustering result on vectors from TF-IDF (2D visualization from UMAP)

BERT sentence embedding

I then tried using sentence embedding models (BERT) to embed reviews into a vector space where the vectors capture the contextual meaning of sentences. As can be seen in the visualization below, the result was OK, but still difficult to identify the clustering if we remove the colors.

Clustering result on vectors from BERT sentence embedding (2D visualization from UMAP)

Contextual Topic Embedding

Bag-of-words information (LDA or TF-IDF) is effective for identifying topics by finding frequent words when texts are coherent within themselves. On the other hand, when texts are incoherent (in terms of word choice or sentence meaning), extra contextual information is needed to comprehensively represent the idea of the texts.

Then why don’t we combine both the bag-of-words and contextual information? By combining LDA, BERT, and clustering, we can keep the semantic information and create Contextual Topic Identification. Below is the result, where clusters are balanced and quite separated.

Clustering result on vectors from contextual topic embedding (2D visualization from UMAP)

Deep Dive into the Model

So how did I achieve this? Let’s do a quick deep dive into the model design. From the text of each original review, we can get the following:

  • The probabilistic topic assignment vector by LDA
  • The sentence embedding vector by BERT

I then concatenated these two vectors (with a weight hyperparameter to balance the relative importance of information from each source). Since the concatenated vector is in a high-dimensional space, where information is sparse and correlated, I used an autoencoder to learn a lower dimensional latent space representation of the concatenated vector. With the assumption that the concatenated vector should be in some manifold shape in the high dimensional space, I got lower dimension representations with more condensed information. I implemented clustering methods on the latent space representations and got contextual topics from the clusters.

Contextual Topic Identification model design

Results

Here are some examples for the model result when fitting on part of the dataset, when fixing number of topics = 10.

Cluster 0

For cluster 0, it appears that people have concerns about the server issues.

Cluster 0: server issue

Cluster 2

For cluster 2, the main topic is potentially the mod in games, which is part of many video games.

Cluster 2: mod in games

Cluster 3

For cluster 3, people are talking about the hacker issues.

Cluster 3: hacker issue

Evaluation

For evaluation, I chose to use both coherence in topic modeling and silhouette score for clustering. Coherence Umass, ranging [-14, 14], is measuring within-topic top word similarity and CV, ranging [0, 1], is an improved sliding-window based version of it. Silhouette score, ranging [-1, 1], is measuring within-cluster consistency. For all these metrics, a larger number means the model is doing better.

I compared the results of topic modeling by four different methods on the Steam review dataset. Overall, Contextual Topic Identification (BERT + LDA + clustering) was the best among all.

Evaluation metrics for 4 methods of topic modeling

Next Steps

There are several potential improvements that I would like to make, given more time.

  • First, I would like to explore more on data pre-processing. Although I had careful normalization designs, there is still more to be done.
  • I could also fine-tune the sentence embedding parameter based on my downstream task evaluation.
  • Another potential improvement is to build customized models for each product (video game in this case) and find more interesting information from them.

Steve Shao built Contextual Topic Identification as a 4-week project during his time as an Insight AI Fellow in 2020.

Are you interested in transitioning to a career in tech? Sign up to learn more about the Insight Fellows programs and start your application today.

--

--

Writer for

AI fellow at Insight Data Science; Master’s in Statistics at Duke University