A conversational AI system is made up of two basic parts: a natural language understanding (NLU) model and a text-to-speech (TTS) model. The NLU model takes in user input, parses it into intents and entities, and sends that information to the TTS model to generate the appropriate response. The components of an NLU model are an intent classification mechanism, intent recognition mechanism, entity classification mechanism and entity matching mechanism. Intent Classification Mechanism: It determines the intent from the user input by analyzing the words in it. Intention is a set of words that express a specific action or request which can be used as a command to complete a task. Intent recognition mechanisms help the system identify if there are multiple intents within an utterance.

This will help to have more specific intent detection in future. For this example we are only dealing with one intent at a time so we will be using just one classifier for our dialog flow design. Intent Recognition Mechanism: After recognizing the initial user’s intent, we then translate it into text for our agent’s response by using other techniques such as machine learning or rules based systems. It also matches similar intents from previous conversations with users stored in our database to come up with a best response for that particular context and situation.

This part of AI helps us recognize complex queries rather than simple ones like “How many days does it take to get there?” or “What is your email address?” Asking questions like these gets much easier once you learn how easy natural language processing is! For this example let’s assume we are dealing only with simple requests like “Find me restaurants near me” which allows us to reuse our intent recognition mechanism for every context. Entity Classification Mechanism: It identifies properties of the entities that the user mentions in his or her input and sends it to the entity matching mechanism. For example, if you ask Google “What is my current location?”, it will respond with your city name and address.

This means Google has identified an entity called “my current location” in your request. Entity Matching Mechanism: This mechanism matches entities mentioned by the user in their query with predefined entities (such as a restaurant name) stored in its database and helps the NLU model generate an appropriate response back to the user. This part of AI helps us recognize complex queries rather than simple ones like “How many days does it take to get there?” or “What is your email address?” Asking questions like these gets much easier once you learn how easy natural language processing is! For this example let’s assume we are dealing only with simple requests like “Find me restaurants near me” which allows us to reuse our intent recognition mechanism for every context. Text-to-Speech Model: It is responsible for converting text to speech.

For this example, we will be using Google’s Cloud Text-to-Speech API. This allows developers to generate high quality speech output in any of the supported languages. Training and Testing Data: Training data is used to train the system by providing it with a dataset that contains both input and expected responses. The training data may come from multiple sources such as user stories, analytics, or even live interactions from your users (such as chat logs). So what exactly is testing data? Testing data is used to test the system by providing it with a dataset that contains input but no expected response. We can use either live conversations or prerecorded conversations for this purpose. Depending on your use case, you can also use other sources of training data such as analytics or user stories. Intent Classification Mechanism For our Intent Classification mechanism we will be using an intent classifier based on Hidden Markov Models (HMM) that was trained using NLTK’s intent recognition toolkit. To train this model we simply used the default HMM implementation provided by NLTK which includes a number of built-in English language intents along with two additional custom intents that we defined: “weather” and “geo”.

Each intent consists of a set of words that help describe the context in which it should be recognized in order to minimize false positives while maximizing recall. To do this, we crafted our intents based on how users typically ask these questions—using natural language and phrasing similar to “When will it rain?” and “What city am I in?”. Below is an example intent definition file with our three custom intents: weather(am): It is going to rain. weather(pm): It is going to rain. geo(<location>) : What city am I in? Intent Recognition Mechanism For our Intent Recognition mechanism we will be using a machine learning-based intent recognition model called GRU, which was trained using Google’s TensorFlow library.

The GRU architecture differs from the traditional Recurrent Neural Networks (RNN) in that its computational graph is made up of gated units rather than deterministic functions, allowing it to learn much faster and more efficiently on more complex input data compared to RNNs. While it is possible to use other Intent Classifiers such as HMM or CRF in this example, we will be focusing on only implementing the GRU since it provides us with flexibility for future enhancements. A sample implementation of the GRU can be found here. To train this model we used an open source dataset called OpenEmoji which contains over 80K utterances from over 2K intents along with corresponding ground truth labels that indicate what intent was recognized for each utterance and what response was produced by the system based on that intent. This allows us to evaluate how accurate our system is at recognizing intents compared to the ground truth data provided by OpenEmoji as well as giving us a baseline for future performance improvements after applying various model optimization techniques such as word embeddings and fine-tuning algorithms like BERT or ELMO.

Below is a sample snippet of our training data: geo1: Find me restaurants near me . geo2: Find me restaurants near my current location . geo3: How many days does it take to get there? geo4: What timezone are you in? weather1: When will it rain? weather2: How many days does it take to get there? weather3: What city am I in? Entity Recognition Mechanism For our Entity Recognition mechanism we will be using a pre-trained entity recognition model called Skip Thought Vectors (STV) that was trained using Google’s TensorFlow library.

The Skip Thought Vectors allows us to recognize entities within the context of a user’s request and send it to the matching mechanism along with the intent information. STV uses skip-gram architecture that predicts all possible entities within a given range based on source words (such as “weather” and “geo”) and then returns the correct one based on its likelihood scores.

We use this model in our Intent Classifier as well but we only need 1 sentence length input for each intent which is enough for us to recognize what type of entity is being mentioned by the user. For example, if you say “When will it rain tomorrow?” you are referring to “weather”. If you ask “What city am I in?” you are referring to “geo”. Below is an example snippet from our training data: geo1: What city am I in ? geo2: How many days does it take to get there? geo3: When will it rain? weather1 : When will it rain tomorrow ? weather2 : How many days does it take to get there ? weather3 : What city am I in ? Entity Matching Mechanism For Entity Matching, we decided to go with a simple rule-based system for now since this is just an introductory tutorial but we can easily upgrade this later by leveraging machine learning algorithms such as Sequence-to-Sequence models or Attention-based models like Seq2Seq to further improve its accuracy.

We used a simple rule-based system called the Finite State Machine (FSM) to generate an appropriate response by matching the entities mentioned by the user in his or her request with predefined entities stored in our database. Here is an example dataset of what we will be using for this example: // Weather entity “weather” : { “am”: { “Weather today in AM”: [ “rainy”, “cloudy”, “sunny” ], }, “pm”: { “Weather today in PM”: [ “rainy”, “cloudy”, “sunny” ], } } // Geo entity geo : { <loc> : { name:<address>, country:<country_code> } } Text-to-Speech Model We will be using Google’s Cloud Text-to-Speech API to generate speech output.

This allows developers to create high quality speech output in any of the supported languages and voices. We simply have to make a request with an audio file as input, which it then converts into text and plays back through a device such as Google Home or your mobile phone.


Marco Lopes

Excessive Crafter of Things

0 Comments

Leave a Reply

Avatar placeholder

Your email address will not be published. Required fields are marked *