You might be training your chatbot wrong
Or why your intent detection isn’t working and how to fix that
Published on 09-06-2023
Let's talk about NLU-powered chatbots now. Every now and then, I receive consultation requests from tech teams seeking expert help reviewing their existing chatbot implementation: "Lena, we have built a chatbot, but it's not working. What are we doing wrong?"
Chatbot development became very accessible these days. Most chatbot development tools offer user-friendly no-code interfaces, which allows anyone, regardless of their prior experience in building chatbots, to create one. We can now prototype and verify our ideas faster, which is really great.
And while it's easy to create a chatbot, building a chatbot that actually works is still hard. I often see how developers that don't have background in Natural Language Processing and Machine Learning struggle to train chatbots that perform well. Without intuition about how Machine Learning algorithms work, it is hard to train your model properly and achieve high accuracy.
It doesn’t matter if you are using advance Maching Learning models, if you don’t know how to correctly label your training data, your model won't be able to accurately predict intents. The quality of your input data directly determines the quality of your output.
Unfortunately, if initially you have labeled your data incorrectly, it often means that you need to reorganize your training data from scratch. That is why it's important to understand how to correctly label the data in the early stages of your chatbot development process. This will save you time and money because fixing it is harder than setting it up correctly from the beginning.
I wrote this blogpost hoping to address some of the most common mistakes my clients make, so that you can avoid them. I share some of the best practices when training chatbots using tools like Rasa or DialogFlow as well tips to improve your intent detection. While reading this blog post won't make you an expert in chatbot development, I hope it will help you gain an intuition of how intent detection models work and help you train intent detection models that perform well.
If you are just starting building an NLU-powered chatbot, I hope this blogpost will help you avoid some of the most common mistakes. And if you already have a chatbot and it’s not working, I hope this blogpost will help you understand why and help you come up with an actionable plan of how to fix and improve the quality of your intent prediction right away.
Let's get started!
Table of content
- ✨ How to properly label training data for intent detection
- 1. Know when to use intents and when to use entities
- 2. Check that your training phrases are unique and don’t overlap across different intents
- 3. Make sure your training data doesn’t have phrases with implicit meaning
- 4. Make your training examples variative enough
- 5. Use AI generated data only to bootstrap the training process but train your chatbot on conversations from real users
- 6. Check that your training data is balanced
- 7. Experiment with confidence threshold
- 8. Know when to use domain-specific word embeddings
✨ The basics
The task of intent detection
Intent detection is the process of identifying the intention behind a user's input. Intents can be thought of as distinct categories representing different user goals. For example, “I want to book X” is an intent.
The task of intent detection is to train a machine learning model to receive a user utterance and predict the intent from a predefined list of intents. Intent detection is essentially a text classification task, where the model aims to learn patterns or rules to differentiate between the different intent clusters.
The role of an AI Trainer
Your goal as an AI trainer is to prepare training data in such a way that it is as clean and unambiguous as possible. You are the one teaching the intent detection model to predict intents, and you want to make it easy for a machine learning algorithm to learn from your training data.
✨ How to properly label training data for intent detection
Now, let's discuss some important principles that will help you prepare training data for your intent detection model.
1. Know when to use intents and when to use entities
The most common mistake I see when labelling data for intent detection is using intents instead of entities. Let's revisit the definitions.
📖 Theory & principles
Intents refer to the underlying intention or goal behind a user's utterance. Intent usually is some action and represents what the user is trying to achieve. For example "searching for a flight" is an intent.
Entities are specific pieces of information within a user's utterance which are important for us. Examples of entities include names, dates, locations. In linguistic terms, entities are usually nouns.
🤔 The problem
There are essentially two main problems when instead of using entities when you should have been using entities you use intents.
Problem № 1: If you use entities as intents, you'll get a lot of overlapping phrases across different intents which confuses your model.
Let's have a look at the following image:
In the picture, we have two intents, one for "renting a bike", and one for "renting a scooter". Now, let’s say a user says “Is it possible to rent a scooter?”. Which intent does is belong to? On one side, “Is it possible to rent a scooter?” question is about a scooter, so it might belong to "renting a scooter" intent. On another side, “Is it possible to rent a scooter?” uses exactly same words as one of the training phrases in "renting a bike" intent. So, is it "renting a scooter" or "renting a bike"? As a machine learning model, we probably get confused and predict neither with high enough confidence score.
Problem № 2: It's harder to choose 1 thing out of 100 than 1 thing out of 5.
When you use entities as intents, you quickly end up with over 100 intents. This significantly increases the complexity of the task for your machine learning model. Instead of predicting just 5 intents and subsequently extracting entities in a separate step, you are now asking the model to predict one out of 100 intents, selecting from a larger set of categories. Consider it from a human perspective: which is easier, choosing one thing from 5 options or one thing from 100 options? I believe the former is easier for humans, and the same principle applies to intent detection model. The more intents the model has to predict, the more challenging the task becomes, making it harder to achieve good performance.
💡 The solution
So what should you do instead? Remember, intent is an action, something that the user wants. What is the intent in “Is it possible to rent a scooter”? The user wants to rent something. This is the intent. And the entity? The entity is "scooter", this is something that they want to rent. They can want to rent "scooter", but they can also want to rent a "bike" or something else. "Scooter" and "bike" are entities.
This is how you use intents and entities correctly:
2. Check that you training phrases are unique and don’t overlap across different intents
Now that you know when to use intents and when to use entities, you need to check that you are adding training phrases to your intents correctly. One common mistake I often see is including either identical or very similar sentences in different intents.
📖 Theory & principles
Our language is very complex, and as humans, we can understand nuances and subtle differences in meaning. We know that "Do you have bikes?" is a bit different than "Show me all bikes you have".
When training an intent detection model however, we can't take into account all the language intricacies, because the number of intents we can train is limited. Thus when labelling training data for intent detection model we come up with a simplified way to represent our language. And for intent detection model to work well, we need to follow the following principles:
- 1. Training phrases within one intent should share a common meaning.
- 2. Training phrases across different intents should be as dissimilar as possible.
🤔 The problem
Deciding which intent a phrase belongs to is a challenging task, particularly because a single sentence can have multiple interpretations. When humans label data, they rely on their intuition and understanding of language, but intent detection models lack this awareness of linguistic nuances. As a result, suboptimal decisions are often made when labeling training data, where ways to group phrases into intents that make sense to humans confuse the intent detection model and negatively impact the chatbot's accuracy and performance.
Let's look at the followign example, are those two different intents or one?
Hard to say. And that's exactly the problem. How do we decide if two phrases belong to one intent or two different intents?
💡 The solution
As a human, it's hard to think as a machine learning model, because we are not. When labelling data for intent detection, it's best to use tools like HumanFirst or other data analytics techniques that have machine learning algorithms under the hood to help you clean your training data.
Go through your training phrases and check two things:
Important thing №1: Check that you never use the exact same training phrase across different intents.
Important thing №2: Check that your intents don't have overllaping training phrases. If two intents are too simliar, sometimes it makes sense to merge them into one bigger intent. And sometimes, it's the other way around, and it makes sense to split one intent into two making sure that all phrases within one intent are united by meaning.
3. Make sure your training data doesn’t have phrases with implicit meaning
Another common mistake I see when training chatbots is including phrases with implicit meanings in your training data.
📖 Theory & principles
Language is complex, and a sentence can have different meanings depending on the situation. Humans understand the meaning by considering the context in which something was said and using our knowledge about how the world works. Intent prediction models lack this ability.
Let me give you an example. Let's say we have a voice chatbot that asks if the user is busy. The user can respond with a direct "Yes" or "No", which we can easily understand. However, the user might also say something like "I am in a supermarket now". As humans, we can understand that being in a supermarket might mean the person is busy shopping and that it might not be a good time to talk because of the noise in the supermarket. We make these assumptions based on the context and our knowledge of how the world works.
Intent detection model is however context independent. It is not aware of what we talked about before, it also doesn’t know what being in a supermarket means for the ability to call. All "I am in a supermarket now" means for intent detection model is that the person is physically located in a supermarket.
🤔 The problem
Remember, the task intent detection model is to learn rules to distinguish different intents, and it learns those rules from our training data based on the semantic meaning of sentences.
Adding phrases with implicit intent to our training data confuses the model's understanding of different intents. When we combine phrases like "I am busy," "I am in a supermarket," and "It's loud" in the same intent, we make it hard for the model to recognisethe the patterns. These phrases have different meanings outside of the context, and by grouping them together, we make the model wonder what makes them similar.
💡 The solution
Here is what I suggest to do:
№1 Mindset change: When labelling your data, always keep in mind that your model is not aware of the context. Look at the meaning of your training examples outside the context, imagining that you need to understand their meaning just by looking at the words you see. This will help you make a decision whether you need to keep them in the current intent or create a new one.
№2 Clean your data: Avoid mixing phrases with explicit and implicit meanings. For example, don’t add "I am in a supermarket now", "I am busy" and "I can’t talk" into the same intent. If two phrases sound similar just based on the meaning of words they consist of, then they can possibly belong to the same intent. However, for you two phrases mean the same thing only in the context of the current conversation, then at leat one of the phrases likely has an implicit meaning and should not be added to the current intent. What should you do then? One option is to create a new intent for these implicit phrases, grouping phrases into one intent based on their explicit meaning. But remember that you don't have to keep all your training examples. Sometimes it's okay to remove some examples from the intent for good for the sake of keeping your training data clean. Let your model figure out the rest during the inference.
№2 Leverage conversation design for your advantage: Remember, chatbot development is a collaborative process. One way to improve the quality of your predictions (that many people forget about) is by improving your conversation design. Ask your conversation designer to design questions in a way to encourage specific user responses. For instance, asking yes/no or close-ended questions encourages users to reply in a very specific and consise way, which is easy for ML to understand. And using open-ended questions encourages a more varied user responses, making it more challenging for the ML model. Make conversation design work together with your intent detection model, not against it.
№3 Don't be a perfectionist: It's not possible to make machine learning model predict everything with high accuracy, there will always be a place for ambiguity. And that's fine. When your model is unsure, add extra follow-up questions in your conversation design and ask the user for explicit confirmation. For example, ask, "Did I understand correctly that you can't talk right now?". Adding explicit confirmation steps can be a simpler solution than striving for perfect AI model predictions for every scenario.
4. Make your training examples variative enough
📖 Theory & principles
Your users can ask the same thing in so so many different ways. Your goal as an AI Trainer is to teach the model generalise to the phrases it hasn’t seen before. To do that you need to include a high variety of examples in your trainig data.
🤔 The problem
If your training data has only similar phrases, your model will get well at predicting examples similar to your training data but will be very bad at generalising to unseen examples.
Imagine, you have a "book flight" intent with the following training examples:
- Book flight
- Book a flight
- Book me a flight
- I want you to book me a flight
Now, if a user asks "I'd like to reserve a flight for tomorrow", it's likely that the confidence with which this intent will be predicted will be low.
💡 The solution
A better training set for our "book flight" intent might look like this:
- I want to book a flight
- Can you reserve me a flight for tomorrow
- Can you help me book a flight ticket?
- Reserve flight
So, check your training data once again, is it variative enough? Get inspiration from real user conversations. Provide a diverse range of training phrases that cover different ways users may express the same intent. Include variations in sentence structure, wording, and order of the keywords, use synonyms. By doing so, you help the model generalise.
5. Use AI generated data only to bootstrap the training process but train your chatbot on conversations from real users
📖 Theory & principles
When you are training your chatbot, you need to make sure that your training data is representative of the real conversations that your users will have with your chatbot. I see it more frequently recently that people use AI generated data to train their chatbots. And AI generated data doesn't always represent how your actual users will speak with your chatbot which in turn hinder the ability of your intent detection model to generalise to unseen conversations.
Using AI generated data in the early stages to bootstrap the process it’s fine, I would do that too. But making it the primary source of your training data is destined for failure.
🤔 The problem
Why not use AI generated data to train your model? Let’s take a quick look at the synonyms returned by ChatGPT for “I can’t chatbot at the moment”.
One of the goals when training an intent detection model is to optimise the model for how your actual users speak. And the way your users speak can vary a lot depending on your domain and your use-case. For example, chatbot developed for teenagers will have to understand modern slang, while chatbot developed for lawyers as a target group will have to understand more official tone of voice. If you want to achieve high quality of predictions, your model has to understand the language of your users. Depending on your target audience, your actual users can say something like “Sorry can’t speak” or “Sorry can’t have a convo right now”. With the AI generated data from the photo, the model might not be able to generalise very well to different ways your users can express their thoughts.
Another goal when training an intent detection model is to make the model generalise to unseen conversations. As you can see from the image, ChatGPT generated examples all start with “I’m”, so there is no syntactic variability. They also use very similar sentence structure, which can be summarised as “I’m” + synonym of “now” + synonym of “can’t talk”. The model trained on this AI generated data might not perform so well on examples that have a very different syntactic structure, as they are not covered in your training data.
Although there are ways to adjust the ChatGPT prompt and encourage it to speak in a certain tone of voice, the underlying issue remains: AI generated phrases often don't reflect how your actual users talk. What happens if you train your model using hundreds of AI generated examples? Your model gets really good at recognising artifitially generated data, and not so good to recognising the real data.
By the way, the same idea applies for coming up with training data yourself. You can only imagine how your real users are going to talk, and it's always better to train your chatbot with on conversations with your actual users.
💡 The solution
Instead of using AI generated data as your primary data source, I recommend the following approach. When starting developing a chatbot, begin with around 20 training phrases per intent and release the chatbot to a small group of actual users as soon as possible. Based on the way your actual users ask questions, add more training phrases to improve the model.
6. Check that your training data is balanced
📖 Theory & principles
One other common mistake is forgetting to check that your trainng data is balanced. It can happen, for example, that you have one intent with 100 training examples, while all the other intents have only around 10 training examples.
To train a robust intent detection model you need a balanced dataset. It's best if you roughly have same amount of training examples per intent.
🤔 The problem
Having an imbalanced dataset can lead to issues such as overfitting, where the model becomes too specialised in your largest intents and fails to generalise well to new, unseen inputs. It wimightll get really good at predicting this huge intent with 1000 examples and will predict it more frequently than it should.
💡 The solution
If you have an imbalanced dataset, where one intent has significantly more examples than others, you need to fix it. This is what you can try:
Solution №1: Evaluate the necessity of excessive training examples: If you have one intent with significantly more examples than others, ask yourself, do you really need all those training examples? Are they all different enough and add value helping your model to generalise? Having multiple training phrases that differ in one word, such as "Do you have a bike?" and "Do you have a bike available?" doesn't add much value to your model. Remove redundant examples, especially those with minor variations in syntax or wording.
Solution №2: Add more training phrases to smaller intents: If you have intents with very few training examples, consider adding more phrases to those intents to balance the dataset. Use your conversation logs to do that and tools like HumanFirst, and make sure you add diverse enough examples.
Solution №3: Consider splitting a large intent into smaller ones: If a single intent contains multiple underlying intents, it might be beneficial to split it into smaller, more specific intents. This allows the model to differentiate between different variations and improves its ability to predict intent accurately.
Solution №4: Merge similar intents: On the other hand, if you have multiple small intents that are very similar in meaning, consider merging them into a bigger single intent. This will reduce redundancy and simplify the intent detection task for the model.
By balancing the training data and making sure each intent has an enough variative examples for the model to learn, you reduce the risks of overfitting and help the model generalize effectively to new user inputs.
7. Experiment with confidence threshold
📖 Theory & principles
The confidence threshold determines the minimum level of confidence required for the chatbot's intent detection model to predict an intent. A well-calibrated confidence threshold helps to find the right a balance between being too cautious and missing out on valid inputs, or being overly confident and providing incorrect or misleading responses.
🤔 The problem
One common mistake people make is setting the confidence threshold too high. This means that the chatbot will only respond when it is extremely confident about its understanding of the user's input. While this might seem like a cautious approach, it can lead to the chatbot being unresponsive or providing "I don't know" responses too frequently. This happens because it may not achieve a high level of confidence for inputs that are slightly ambiguous or not covered explicitly in its training data. As a result, users may become frustrated with the chatbot's limited capability to understand and respond to their queries.
On the other hand, setting the confidence threshold too low can result in the chatbot providing incorrect or nonsensical responses. When the threshold is low, the chatbot might respond even when it is unsure about the meaning of the input. This can lead to inaccurate information being presented to the users, which will negatively impact the user experience.
💡 The solution
To fix this, you need to do few experiments.
Don’t just use the default confidence threshold, play around with different thresholds and analyse what are the confidence levels at which your model predicts different intents. If your chatbot frequently responds with "I don't know", try lowering your confidence threshold. If your chatbot predicts an intent when it actually should have predicted "I don't know", try setting your confidence threshold higher. Adjust and fine-tune the confidence threshold until you achieve the desired outcome.
8. Know when to use domain-specific word embeddings
📖 Theory & principles
Before text is sent to an intent detection model, it needs to be converted to a numeric format. Embeddings are a way to represent textual input in a numeric format that captures semantic meaning of the sentence.
Different techniques can be used to train word embeddings, including Word2Vec, GloVe, and FastText. These methods learn to represent words as dense vectors in a high-dimensional space, where words with similar meaning are located closer together.
When your embeddings are well trained and are representative of your vocabulary, if you project all the sentence embeddings you trained into a high-dimensional space, sentences that have similar meaning would be located close to each other and those that have different meanings would be located far away from each other. Just like in this picture.
Sentence embeddings are the base of your intent detection model. If you train your intent prediction model using embeddings that don't represent the meaning of your domain vocabulary accurately, the model most probably won't perform well.
🤔 The problem
In most use-cases, using standard embeddings offered by the chatbot development platform is enough. They are trained on a large amount of data, such as wikipedia or news articles, and are representative of the general language. However, if you build a chatbot for a very specialised domain, for example, a chatbot that needs to know different medical terms, you might need to use domain-specific embeddings.
If your chatbot's data contains domain-specific vocabulary that is not covered by the standard word embeddings, the embeddings may be unaware of the specific meanings associated with those terms. This can result in the chatbot struggling to understand and correctly respond to queries related to your domain. The use of general-purpose embeddings may also lead to misinterpretations, especially if the vocabulary has different meanings in general language or includes specialized property or unit names.
💡 The solution
Some chatbot development platforms, like Rasa, allow the using your own domain-specific embeddings. By training embeddings on your own data, you can ensure that the chatbot has a better understanding of the specific vocabulary and meanings relevant to your domain. This in turn will help your intent detection model make better predictions. If you already tried everything from the above and nothing works and you think embeddings might be the problem, then perhaps you need more than what standard chatbot development platforms can offer. You can have a look at what Rasa has to offer, for example.
Final words
I believe that understanding the basic principles of how intent detection works and how to label your training data correctly you can already help you improve the quality of your intent detection models. I hope this blog post helped you reflect on your current implementation and get ideas about what you can do better.
Keep in mind that it’s not enough to just set up the training data once. As you get more users, it’s important to keep monitoring and analysing conversation logs and updating your model to adjust it for user needs if necessary. Maintaining and keeping your training data clean is equally important as setting it up correctly initially.
If you still have questions left or would like a more in-depth review of your existing implementation, you can always reach out to me and book a consultation, I'd be happy to help. And for tech teams without any prior NLU experience, I also offer training sessions and workshops.
Thank you for reading, and best of luck with your chatbot project! Let me know if you found this blogpost useful in the comments on LinkedIn or send me a short email. It's a great motivation for me to keep writing and sharing my knowledge.
Take care :)