The primary key to interact with each other is a conversation. Humans are so involved with each other that they must communicate for everything, let it be a grocery vendor or an airport ticket booth. In this social media era, we prefer to communicate via texts, messages, voice messages instead of face-to-face conversation. The modern generation prefers this medium as we use it in daily life and hence feel very comfortable using it.
Communication is not just limited to text and speech. It has a wide range of modes. For example, different hand gestures, face gestures, manuscript, eye gestures, Pictures, Videos, visual, aural, spatial, etc. These are various modes of interaction hence called multimodal interaction.
Conversational ai can be made better with multimodal interactions. Many big firms are already adopting chatbots for daily servicing to customers as it makes their work easier. If we allow users to talk with a machine using any gestures or different conversation modes, we can genuinely reach an advanced AI milestone.
In his research paper, Popescu claims that, in conversational AI, generally user inputs the agent in a single-mode, let it be text or speech, and gets output in the same mode as the input. The system should anticipate the user's needs and deliver appropriate responses. Still, when you over-deliver so much content in a single-mode that the user won't be able to grasp any of it, it becomes futile. So to provide the purpose of simulated human conversation, agents should have multimodal capabilities.
The modalities used are primarily the visual, auditory, and spatial ones. The number, quality, and interaction between such patterns are crucial to the simulation's realism and eventually to its usefulness.
People naturally interact with the world multimodally. Considering the increasing trend in the area of multimodal system design, different surveys have been conducted to explore how users can combine other modalities while communicating with natural systems. Examples of the modalities mentioned above are:
Typed command language.
Typed natural language.
Spoken natural language.
During the interactions between computers and users, chunks of data are transmitted across several modalities from the user to the computer and vice-versa. Some related pieces of information can be grouped into higher-level entities. Hence, multimodality can be defined as "the hand in hand cooperation between several modalities to improve user-computer interaction."
Architecture, Design, and Modelling:
The Simplest mode of conversational AI, the chatbot, takes text as input and outputs the same. It can maintain the context of the conversation. Chatbot usually consists of the following components:
Multimodal Interaction system is consists of an interaction manager, fusion, and fission. Modality is any component that supports interaction.
Fusion is mainly responsible for gathering all the information; it gathers commands from various modalities and aggregates them further. And generating different responses over various modalities is done by fission. Fission opposite to fusion divides a single event and provides output to other modes.