Clustering sentence embeddings to identify intents in short text

The unsupervised learning problem of clustering short-text messages can be turned into a constrained optimization problem to automatically tune UMAP + HDBSCAN hyperparameters. The chatintents package makes it easy to implement this tuning process.

User dialogue interactions can be a tremendous source of informati on on how to improve products or services. Understanding why people are reaching out to customer service is also an important first step in automating some or all of the replies (for example, with a chatbot). There are several ways to analyze dialogue interaction data to extract useful insights, and it is common to characterize the interactions by topics discussed, sentiment and intent.

Determining the intent is particularly useful from the perspective of wanting to improve products or services because it answers the question: why are people reaching out to begin with? However, a major hurdle to leveraging user message intent is that determining it is typically treated as a classification problem. This means you usually need to already have a significant amount of labeled data to get started. For example, Microsoft’s LUIS and Google’s Dialogflow both start with the premise that you can either use prebuilt domain labeled data or that you already have labeled data.

But what if you don’t yet have any labeled data and you don’t think any publicly-available labeled data is relevant (as is often the case)? On top of the challenge of having an unsupervised learning problem, the messages containing the intent are typically quite short (less than 15 words). I was recently tasked with this challenge, with one additional hurdle: we only had about 1,000 samples total. I immediately remembered some sage advice I had seen years ago:

After thinking about my particular problem for a bit and making a few unfruitful attempts at treating this as an unsupervised learning problem, I ultimately manually labeled the data (it took about a week…). Doing the labeling manually gave me a helpful appreciation and intuition for the data. But at the same time it made me very curious to see if there is a way to get most of the way to those labeled intents in a much more automated way. This post will provide an approach I learned that can automatically cluster short-text message data to identify and extract intents.

Before we go further, let’s first define what we’re trying to do. Here I’m interested in answering the question:

given an unlabeled set of dialogues between a user and some company representative, is there a way to obtain a helpful labeling of user intents in an automated way?

As this is an unsupervised problem and labeling intents can be quite subjective, I wouldn’t expect to be able to find a perfect solution. But, similar to how auto-EDA libraries aren’t exhaustive but can provide a helpful starting point when confronted with new data, can we do something to provide initial insights before committing to time-consuming manual labeling? It’s possible that the automated results could be good enough, saving someone from a week or more of manually labeling data. Alternatively, it could speed up the labeling process by providing a helpful starting point.

Obviously, I’m not able to share the original dataset that inspired this article, so I set out to find something as similar as I could that is publicly available. While several dialogue datasets exist that have labeled intents, a major limitation in many of them is the small number of intents represented (often ~10). Having a small number of intents, or classes, will make the problem too simple. Although the manual labeling process is inherently subjective, I found there to easily be more than 50 intents, unequally represented, for the data I was working with. That seems to be fairly common for real-world applications.

Thankfully the PolyAI team published the banking77 dataset, which contains 77 intents represented unequally:

The full dataset contains 10,0003 messages in the training dataset across 77 intents. The maximum and minimum category counts are 187 and 35, respectively.

In order to more closely match the challenge I previously faced, I’ll randomly sample only 1,000 of the 10,000 total samples from the training set:

Note that while this dataset is useful for this exercise for demonstration purposes, this is still somewhat artificial and you’ll face additional challenges in a real-world setting. For example, you’ll first need to identify what message or sentence in a full dialogue sequence actually relates to the intent, as well as handle random system error messages, typos and nonsensical messages. This post on discovering and classifying AirBnB message intents touches on some of the real-world challenges.

There are several ways to approach an unsupervised learning problem like this. Topic modeling was the first method that came to mind when confronted with this problem. It’s a technique used to discover latent topics in a collection of documents.

Many algorithms can be used to perform topic modeling, but one very common one is Latent Dirichlet Allocation (LDA). LDA is a generative probabilistic model that assumes that each document is made up of a distribution of a fixed number of topics and each topic is made up of a distribution of words. A big challenge when trying to use LDA (and many other topic modeling algorithms) is deciding how many topics to actually use, which is a necessary model hyperparameter. Obviously, if that is what we’re hoping to get out of the analysis then this is a problem. Coherence is one way to assess the quality of the learned topics by measuring how similar the words are in each topic, and a higher coherence score is better. Gensim, a very popular library for topic modeling, makes it easy to calculate model coherence. Unfortunately, for the short texts we’re working with here, it isn’t obvious how many topics to pick using coherence:

It appears as though increasing the number of topics continues to increase the coherence for this dataset, giving us little guidance as to how many topics to choose.

Adding to the challenges, topic models can be hard to interpret. For example, consider the identified topics below:

While some of the topics make sense, many are hard to interpret. This article series about work at Pew Research does a great job walking through the challenges of interpreting topic models.

Ultimately, the biggest issue is that intents are more nuanced than topics. A limitation in LDA and other topic modeling approaches is that they treat the vocabulary in the documents as a bag of words, where the order doesn’t matter. This works well for longer documents (and a larger corpus), where identifying words that co-occur can provide a good picture of the topics. Additionally, there are often a relatively small number of topics, and the topics are fairly distinct. However, short-text intents create challenges, such as two phrases having nearly identical words but very different intents or having the same intent but almost no words in common. This severely limits the usefulness of standard topic modeling approaches for identifying intents in short text.

Aside from topic modeling, clustering is another very common approach to unsupervised learning problems. In order to be able to cluster text data, we’ll need to make multiple decisions, including how to process the data and what algorithms to use.

First, it is necessary to represent our text data numerically. One approach is to create embeddings, or vector representations, of each word to use for the clustering. This article gives a good overview of various ways of embedding words. Since each message consists of several words, one option is to then simply average the individual word embeddings of all the words in each message. This has worked well enough for some applications, but it would be better to directly calculate an embedding for the full sentences to more effectively take meaning into account. Especially given how short each message is, this will help avoid some of the pitfalls of the topic modeling algorithms described above.

It turns out that there are many ways to find a single vector representation of a full message or sentence. This article gives a great overview of the various methods to achieve this. Google’s Universal Sentence Encoder (USE), first published by Cer et al in 2018, is a popular sentence embedding model. The USE model was trained on a variety of data, including Wikipedia, web news, web question-answer pages and discussion forums, and it performs well on sentence semantic similarity tasks.

In 2019, Reimers and Gurevych published a paper introducing Sentence-BERT, a “modification of the pretrained BERT network that use siamese and triplet network structures to derive semantically meaningful sentence embeddings that can be compared using cosine-similarity”. They also released a Python implementation that makes it easy to download and use many different pre-trained models.

Given how small our dataset is, using a pre-trained model is preferable here. For this analysis, I’ll compare the results of four pre-trained sentence embedding models: USE and three different sentence-BERT models (all-mpnet-base-v2, all-MiniLM-L6-v2 and all-distilroberta-v1).

Converting our messages into sentence embeddings is then straightforward:

As seen above, all of our sentence embeddings have high dimensionality (>500 features each). A manifestation of the Curse of Dimensionality is that distance measures, such as Euclidean and Manhattan, needed for clustering become meaningless at such high dimensions (see for example “On the Surprising Behavior of Distance Metrics in High Dimensional Space” by Aggarwal et al for more details). While some of the sentence-transformer pre-trained models were created in a way to preserve the usefulness of some distance measures, dimensionality reduction before clustering will greatly improve the results. (To prove this to myself, I briefly explored using different clustering algorithms and embeddings without dimensionality reduction first.)

Uniform Manifold Approximation and Projection for Dimension Reduction (UMAP), introduced by McInnes et al in 2020, has quickly grown in popularity as a dimensionality reduction technique. UMAP is much faster and more scalable than t-SNE, while also preserving the global structure of the data much better. This makes it useful for both visualization and as a preprocessing dimensionality reduction step to use before clustering. We’ll use it here for both.

The Scikit-learn documentation has a helpful overview of the many different clustering algorithms it supports and when each performs best. For our current application, it is preferable to use an algorithm that does not require specifying the number of clusters upfront and can also tolerate noisy data. Density-based algorithms are a good option here as they do not require specifying the number of clusters and are indifferent to cluster shape. Hierarchical Density-Based Spatial Clustering of Applications with Noise (HDBSCAN) has become popular since it has fewer and more intuitive hyperparameters than DBSCAN and is robust to variable-density clusters. The HDBSCAN documentation provides a helpful comparison of different clustering algorithms. HDBSCAN worked best for the current problem, so we’ll focus on it for this post.

There are at least two packages (and likely many more) already available to chain UMAP and HDBSCAN together for the purposes of topic modeling: Top2Vec (github and paper) and BERTopic (github and article). However, the default hyperparameters used in both packages do not work well for problems like the current one with short text and a small corpus (most of the data ends up being classified as noise and only three clusters total are found). To make it easier to tailor to our current problem of intent extraction, we’ll instead directly use the UMAP and HDBSCAN packages for hyperparameter tuning.

UMAP has several hyperparameters that control how it performs dimensionality reduction, but two of the most important are n_neighbors and n_components. The n_neighbors parameter controls how UMAP balances local versus global structure in the data. This parameter controls the size of the neighborhood UMAP looks to learn the manifold structure, and so lower values of n_neighbors will focus more on the very local structure. The n_components parameter controls the dimensionality of the final embedded data after performing dimensionality reduction on the input data. Unfortunately, there is no obvious way to pick the best UMAP parameters by itself without ground truth labels. Here we do have the labels, which we’ll use at the end to determine how well we did. But the point of this work is to identify a methodology to use when we have unlabeled data. In Angelov’s Top2Vec paper he mentions that n_neighbors = 15 and n_components = 5 worked best for his downstream tasks, but it is unlikely this would always be the case for any dataset.

HDBSCAN also has several important hyperparameters, but the most important one to consider is min_cluster_size. Intuitively, this controls the smallest grouping you want to consider as a cluster. In addition, the min_samples parameter, which defaults to being equal to min_cluster_size if unspecified, controls how conservative the clustering is. The larger it is, the more points are discarded as noise/outliers. Decoupling the two hyperparameters and having a smaller min_samples than min_cluster_size will essentially keep points that would have been labeled as outliers by merging them with their most similar neighboring clusters. This isn’t exactly what we want to happen if we’re trying to uncover the number of clusters. Thus, here I’ll only consider directly modifying the min_cluster_size parameter:

Note that UMAP is a stochastic algorithm, using randomness to speed up approximation steps and perform the optimization. Thus, we’ll set the random seed state to a constant value to get consistent results for a given set of UMAP hyperparameters.

We now have a pipeline with three hyperparameters (n_neighbors, n_components, and min_cluster_size) that we want to tune. Next, we need to decide how to actually evaluate our clusters to select the best hyperparameters. Although commonly used with various clustering algorithms, Silhouette Score is not a good validation metric for density-based algorithms like DBSCAN and HDBSCAN since it assumes all points are assigned a group and can’t appropriately handle noise/outliers. Density Based Cluster Validation (DBCV) has been proposed and used by some for tuning HDBSCAN hyperparameters. While it likely works well for several applications, for this current problem it favored having a smaller number of clusters and leaving too many samples in the “noise” category.

Instead, we’ll leverage the useful probabilities_ HDBSCAN attribute, which from the documentation is:

The strength with which each sample is a member of its assigned cluster. Noise points have probability zero; points in clusters have values assigned proportional to the degree that they persist as part of the cluster.

This article by Nikolay Oskolkov provides a very intuitive and logical solution of simply defining our cost function that we want to minimize as:

This will help ensure that we assign as many data points as we can to actual clusters instead of labeling them as noise. But what’s stopping us from setting the hyperparameters to make every individual point a “cluster”, or just making one giant cluster?

Here we have to use some domain knowledge to apply constraints. For this problem, based on my experience with this kind of data, I expected there to be at least 30 labels, but no more than 100. So our objective function becomes a constrained optimization problem:

With the current dataset size of only 1,000 samples, it still takes about 3 seconds to generate the clusters and score them for a given set of inputs. Attempting to do a full grid search of, for example, a 10 x 10 x 10 hyperparameter search space would take almost an hour. Larger dataset sizes would take even longer. I care about finding the right hyperparameters, but not that much. Performing a random search instead of a full grid search is a pretty effective alternative strategy:

Running the search for 100 randomly-selected hyperparameter values yields:

We see that the runs with the lowest cost also have less than 10 clusters total. The first entry with a number of clusters between 30 and 100 has 59 clusters and a cost of 0.140 (i.e. about 14% of the data was labeled as outliers or low confidence). It only took 6 minutes to run too. Not bad.

Randomly searching the hyperparameter space works reasonably well, but there is a better option: Bayesian optimization. Here we’ll leverage the popular hyperopt package to do so. If you’re unfamiliar with hyperopt and Bayesian optimization, this article provides a good overview.

First, define the objective function that we want to minimize. The optimization constraints are included within the objective function by adding a penalty term if the number of clusters falls outside of the desired range:

Then minimize the objective function over the hyperparameter search space using the Tree-structured Parzen Estimator (TPE) algorithm:

Running the Bayesian search with 100 max evaluations over our parameter space yields slightly better results than random search:

It’s then easy to run the pipeline using embeddings from multiple different models:

At this point we could do a few more things, like visualize the clusters or manually inspect some of them to make sure they make sense. But ultimately we’re trying to find the “best” clustering results from the best analysis pipeline. If we trust our loss function, then it makes sense to pick the configuration with the lowest loss. Of the combinations tried above, it seems that we should go with sentence-transformer #1 (all-mpnet-base-v2), which generated 55 clusters using n_neighbors = 6, n_comonents= 9, and min_cluster_size = 6.

In this case, we happen to also know the ground truth labels so we can see how well our loss function correlates with performance. We can manually inspect how well the models did on some of the ground truth clusters:

As shown above, all models seem to do relatively well on placing most of messages in the card_about_to_expire ground-truth group in the same clusters. At least for this category, the first sentence-transformer model seems to stand out in correctly assigning all messages to the same cluster.

Rather than manually inspecting all groups, we can also quantitatively assess the model performances. Two commonly-used metrics for evaluating text clustering are the Normalized Mutual Information and Adjusted Rand Index. Both metrics have values ranging from 0 to 1, where larger is better. Calculating these metrics for the best hyperparameters of the four models under consideration yields:

In agreement with our previous conclusion, sentence-transformer #1 does in fact perform the best, with an ARI of 0.46 and an NMI of 0.81. However, the performance ordering for some of the other models do not follow the order expected from their cost function values. Thus, our scoring method for hyperparameter tuning isn’t perfect, but it is clearly still useful for the current application.

To make the results even more helpful, we can also automatically apply descriptive labels to the clusters we found. A paper by Liu et al provides an interesting approach to doing this by extracting the most common action-object pair from the phrases in each cluster as the cluster label (e.g. “book-flight”). The bank77 dataset we’re considering here is a little more complicated than the dataset in that paper, but we can do something similar. Here we’ll concatenate the most common verb, direct object, and top two nouns from each cluster. The spaCy package has a powerful syntactic dependency parser that we can use for this:

We can write a simple function to extract these labels for each cluster:

Applying these labels to each of the clusters that our best model found yields our final result:

Using our tuning method, we’ve automatically extracted and applied descriptive labels to 55 clusters in our dataset.

Since in this case we know the ground truth labels for each document, we can also inspect a few documents to see how well our derived labels match the ground truth labels:

They aren’t perfect, but the extracted labels match the ground category labels pretty well!

In this post, I outlined a framework for leveraging domain knowledge to create a constrained optimization problem to automatically tune UMAP and HDBSCAN hyperparameters. This allows us to easily cluster short-text documents and apply descriptive labels. The focus for this article was on a small dataset, but the same method can be applied to larger datasets as well. The clustering results provide helpful insights of unlabeled text data in a very short amount of time, before deciding or needing to complete time-intensive manual labeling.

All code examples from this article, along with the chatintents python package I created to make applying these concepts easier, can be found here:

Thanks for reading. If you found this article helpful, drop a comment below or connect on LinkedIn.

Data Science | Product | Healthcare & Renewable Energy |



Thanks to Elliot Gunn. 




Your home for data science. A Medium publication sharing concepts, ideas and codes.

Read More