The AIgent: Using Google’s BERT Language Model to Connect Writers & Representation

Published in

Insight

13 min readMar 12, 2020

In 2013, Robert Galbraith — an aspiring author — finished his first novel, Cuckoo’s Calling. It had all the trappings of a great story: a suspicious death, a private investigator haunted by his past, intrigue, pace, and misdirection. There was only one problem: literary agents, the gatekeepers of the publishing industry, kept rejecting the book — often without even looking at it.

Galbraith eventually opted to publish Cuckoo’s Calling through an acquaintance of sorts. Interestingly, on the day that Galbraith was discovered to be none other than J.K. Rowling — author of the Harry Potter series — sales of Cuckoo’s Calling increased by over 150,000 percent. In other words, Galbraith had chops — but the publishing industry failed to see it.

As an aspiring writer myself, Galbraith’s tale is all-too-familiar. And it illustrates a major pain-point in the life cycle of stories: literary publishing is a $20 billion industry…with a bottleneck problem. Much like the challenge faced by hiring managers, literary agents can receive thousands of pitches every year, and there is no easy way to triage them. This, in turn, means that writers often send their materials to every agent they can find. The end result is a nasty and increasingly impersonal feedback loop that builds unnecessary costs and delays into publishing.

With breaking this bottleneck in mind, I’ve used my time as an Insight Data Science Fellow to build the AIgent, a web-based neural net to connect writers to representation. Using nothing more than a book’s synopsis, the AIgent can surface similar books, genre tags, and sales proxies. This will allow agents to filter their inboxes to better focus on the pitches that best align with their portfolios. Likewise, it will help writers identify agents who are a good fit. Ultimately this will mean a more streamlined, inclusive, and personal publishing industry.

The AIgent was built with BERT, Google’s state-of-the-art language model. In this article, I will discuss the construction of the AIgent, from data collection to model assembly. I will also cover simple extensions of the AIgent, its cross-media potential, its power as an unbiased, content-based recommender system, and its capacity to increase fairness in content acquisition. Lastly, I will discuss my own experience as an AIgent beta-tester. On the day I finished my last novel, I did not begin emailing agents in search of representation. Instead, I built the AIgent.

Building the AIgent

Data Collection

The AIgent leverages book synopses and book metadata. The former is a short block of text, generally between 50 and 400 words in length. The latter is any type of external data that has been attached to a book — for example genre tags, ratings, and sales.

To my knowledge, the most extensive repository of synopses and metadata is Goodreads. For my purposes, the most interesting data on Goodreads comes in the form of genre/content tags. A given book may be associated with hundreds of different tags, ranging from broad genre labels, like ‘Mystery’ or ‘Historical’, to specific content tags like ‘Werewolves’ and ‘Vampires’.

To collect these genre tags and other metadata, I took advantage of the well-documented Goodreads API. Unfortunately, that API does not permit collection of synopses. To get around this, an enterprising and motivated individual might use Scrapy and Beautiful Soup to scrape synopses. To build the AIgent, I started with synopses and metadata from 100,000 books.

Features: DistilBERT Text Embeddings

Once I had a raw dataset, I could begin engineering features and building a natural language processing (NLP) model. Ultimately, I wanted this model to do two things:

Build a rich representation of a text synopsis for external comparisons
Mathematically describe the relationship between a book’s synopsis and its metadata

To get there, I needed to do a bit of pre-processing on my synopses (i.e. features) and metadata (i.e. labels).

The most powerful approach for the first task is to use a ‘language model’ (LM), i.e. a statistical model of natural language. These models use a variety of approaches to transform text into numerical representations, which can then be used for downstream tasks such as classification and semantic parsing.

The last five years have seen an incredible increase in the power of LMs. As in the world of image classification, some of the greatest advances in LMs have come with transfer learning, and with unsupervised neural nets that build ‘general’ models of language. This class of model includes OpenAI’s generative text model GPT-2, which produces eerily human-like text at the sentence to short paragraph level.

More relevant to the AIgent is Google’s BERT model, a task-agnostic (i.e. general-purpose) LM that has thus far been extended to over 100 languages and achieves state-of-the-art results on a long list of language tasks, including sequence classification. If you have not used BERT before, this Colab notebook is a great place to get started. I tested several different flavors of BERT for use as synopsis classifiers before settling on the DistilBERT model from Hugging Face. It’s much faster than the full BERT model without sacrificing much in the way of performance.

DistilBERT tokenization and embedding, from Jay Alammar.

Text synopses are ‘tokenized’ with the aid of a reference library. They are then passed to the DistilBERT neural net to produce ‘embeddings’. These can be thought of as high-dimensional text representations, similar to the representations of images produced by the hidden layers of convolutional neural nets. As I’ll discuss below, these embeddings form the basis of the AIgent’s ability to search for similar titles.

Typically, the exact task DistilBERT performs depends on what is used as its output layer (e.g. a softmax function for classification). Rather than employing a built-in output layer, I collected text embeddings directly from DistilBERT’s final hidden layer and used them to build a ~100,000-member database of synopsis embeddings.

This database can be used as (1) a library against which an unknown synopsis can be queried for similarity and (2) a feature set for the training of classifiers. Using features directly (rather than a built-in output layer) gave me tons of flexibility in how the classifiers could be trained. For a more detailed explanation on using the output from DistilBERT’s last hidden layer for a simple classification task, I recommend this excellent post by Jay Alammar.

Labels: ‘Warm-Encoded’ Genre Labels

The use of Goodreads genre tags as labels added both power and complication to my task. The tags are user-generated, which makes them an excellent reflection of the ‘user experience’ of a given book — as opposed to how the book’s author might describe it. That is an important distinction — Yelp would be less popular and less reliable if it only told users what restaurant owners say about themselves.

But these tags are not necessarily a ‘ground truth’ about books, and many tags appear to be spurious or incorrect. One way that we can get around this is to use the proportion of tags that fall into a given class as a measure of our degree of confidence in that class association. In other words, if 0.1% of users tagged Cuckoo’s Calling with ‘science-fiction’, but 10% of users tagged it with ‘mystery’, we should be more confident in the latter label.

After some experimentation, I landed on a strategy I’ll call ‘warm encoding’: if greater than 1% of tags were in a particular class, I encoded the book as belonging to that class, non-exclusively. It is worth keeping in mind that the choice of threshold will impact downstream metrics like precision as well as shifting the size of our dataset. In a later version of the AIgent, I expect that there may be performance gains associated with adjusting my warm-encoding strategy based on qualities like the shape of tag distributions and levels of user experience

Multi-label Classification

With my features and labels in hand, I set out to build a classifier. Because a book might have many different genre tags, this is a multi-label classification problem. Many models have been used for multi-label text classification, including Naive Bayes, Random Forest, and Support Vector Machines. For my purposes, I was interested in a model that was interpretable, not likely to overfit, and easily extensible to new genres. Moreover, I had plenty of balanced data.

For these reasons, I settled on logistic regression with Scikit-learn, and used a one-versus-all approach with each of the 20 most common genre labels. Model fine-tuning was performed to maximize precision, because (1) incorrect genre predictions are likely to suppress user engagement and (2) genre labels are semi-overlapping, which deemphasizes returning all correct labels. At inference time, the AIgent uses a synopsis embedding and one model per genre to produce a list of probabilities of class membership. On my test set, this approach resulted in~75–95% accuracy and ~.65 to .95 F1 scores across genres. Future versions of the AIgent may be improved with the use of more nuanced metrics, depending on the class-encoding scheme applied.

Test Case: Dune

Let’s see an example of genre tag prediction in action. As an input synopsis, I’ll use Dune, set in the year 10191 on the desert planet Arrakis. Here are the top five predicted tags and their associated probabilities:

‘Science-fiction’ and ‘fantasy’ are the top predicted labels, as we would hope. But the AIgent has also predicted ‘historical-fiction’. One of the most exciting early results from the AIgent was a tendency to predict genre tags that feel right — but aren’t. Dune is not historical fiction, because its story did not take place in our history. Other titles predicted to be historical fiction include The Time-Traveler’s Wife and Bill and Ted’s Excellent Adventure. Why is the AIgent predicting this label? Because the AIgent has learned that stories which take place in far-flung locales, or in historical timelines, are likely to be historical fiction.

Why is this useful? A writer pitching Dune is unlikely to call it historical fiction. But a literary agent who receives this analysis will be informed that Dune belongs to a subclass of science-fiction that take place in a setting reminiscent of historical fiction. That is an actionable insight and a clear gain over the status quo.

Vector Similarity of Synopses

Finally, in addition to my classifier, I needed a way to compare unknown text synopses against my database of embeddings. This is a ‘document distance’ problem, and is typically approached with cosine similarity. For a great primer on this method, check out this Erik Demaine lecture on MIT’s open courseware.

As a means of validating this approach, I selected groups of highly-prolific authors and measured the cosine similarities between their own works and the works of other authors. To visualize the results, I performed dimensionality reduction with t-distributed stochastic neighbor embedding:

As can be seen in the above example, the works of each author are well-clustered. There are interloping blue and orange dots, but these turn out to be one-off books written outside of the author’s main genre. As an added bonus, I was excited to see that Robert Galbraith clusters with J.K. Rowling. This clustering is a great indication that the AIgent can surface similar titles.

The AIgent at Work

Cross-Media Potential

With these two major functionalities in place — measuring cosine similarity between embedded synopses, and predicting likely metadata, I launched a beta version of the AIgent as a web application. The workflow is simple: the user enters a text synopsis into a search bar. The AIgent embeds that synopsis, and measures cosine similarity across the embeddings database. My trained regressors are then used to predict metadata. What is returned to the user is a list of similar titles and their associated metadata, as well as a list of likely genre tags. Below are some screen grabs of the AIgent at work, using the synopsis from The Expanse.

In the above example, my query synopsis is from the television version of the science-fiction series The Expanse. This is a great example of the cross-media potential of the AIgent: a television synopsis was used to identify similar books. Depending on the underlying database used for similarity matching, the AIgent can predict movies from movies, books from movies, movies from books, etc. This functionality should immediately make the AIgent useful to the buyers of filmed content, or to agents who are most interested in adapting books to film.

Popularity Bias in Content Recommendation

At its core, the AIgent is a content-based recommender system, i.e. its predictions are built on qualities of the book as opposed to qualities of the users who interact with it. Why is this important? Because many recommender systems — and human recommendations — suffer from what is called ‘popularity bias’. Popularity bias causes us to recommend not just similar items, but popular similar items. Why? Because those are the ones we are most exposed to.

This approach can be problematic for content buyers, who may wish to know not just which popular items are similar — but whether similar items are popular. If the queried synopsis occurs in a space with very poor recent sales, that doesn’t mean it’s a bad book. But it might be a bad bet financially.

The AIgent addresses this issue by including a popularity filter. Literary agents and other content buyers can choose to surface either the most similar titles, or the most similar popular titles. This will both enhance screening and strengthen a buyer’s ability to quickly evaluate an unpublished book’s market.

Fairness in Content Acquisition

The lion’s share of writers who manage to land an agent do so by pursuing a Master of Fine Arts (MFA) degree. In some regards, this makes perfect sense. As with many other fields, those who are highly-committed will undertake graduate-level coursework. And those who are most capable of handling such coursework have, in a sense, been pre-screened for some of the talents required to work at the forefront of the field.

But there are a few obvious flaws with such a feeder system. First off, it imposes a financial filter. Only those with the means to pursue graduate study actually do so. And while it is true that graduate study comes with financial hardship for just about everyone that does it — that doesn’t mean that every talented writer is equally able to pursue an MFA.

Secondly, there is a geographical filter. While publishing is increasingly global, a feeder system focused on western universities will not lead to a similarly global base of writers. And that can mean an inability to source stories from people with lived experience.

Lastly, the current system does not help publishers react to market trends — for that, you need to crowdsource and move quickly. To summarize these points, if there is a sudden need in the market for climate fiction set in the Maldives, the AIgent can help bubble those submissions to the top, regardless of an author’s CV. In that sense, the AIgent can help publishing become more fair and inclusive, and can open up new markets in the process. In an age when there is (rightfully) plenty of focus on making sure that our algorithms are fair, it is easy to forget that algorithms can themselves be used as an engine of fairness.

The Beta Test: My Own Novel

I am not just the architect of the AIgent; I am also its beta tester. As I mentioned above, I began to build the AIgent shortly after finishing my last novel. I love my book, and I hate the thought of it languishing in strangers’ inboxes. But rather than lament that the system is not working, I decided to try to change it. The first synopsis digested by the AIgent was the synopsis of my own book, Tantalus, a piece of speculative fiction that takes place at a remote Siberian research station.

The most similar book to mine, according to the AIgent.

What did the AIgent say? As it was designed to do, the AIgent gave me a list of similar books. I was both pleased at how similar they were, and horrified at how suddenly crowded the content space around my book seemed. The AIgent also gave me a list of the agents who represented those books.

With that knowledge in hand, I sent out a small number of personalized messages to those agents, and explained that a state-of-the-art neural net had connected us. What happened next? I could tell you, but it would ruin the story.

Ryan Dalton built The AIgent as a 4-week project during his time as an Insight Data Science Fellow in 2020.

Are you interested in working on high-impact projects and transitioning to a career in data? Sign up to learn more about the Insight Fellows programs and start your application today.

The AIgent: Using Google’s BERT Language Model to Connect Writers & Representation

Building the AIgent

The AIgent at Work

Written by Ryan P. Dalton