Most interactions with AI are through a text-driven interface. Users craft a prompt, and the AI delivers a response. That might be a text response, an image, audio or even some form of video output. But there’s the problem, it is output, not input. AI scans images, video, and audio, but not as humans do. It is a single modal interface while humans are multimodal.
To understand how to get to multimodal AI, Enterprise Times editor Ian Murphy talked with Duncan Curtis, SVP for Gen AI and AI Product and Technology at Sama. Curtis has a history of working with autonomous vehicles and AI. That gives him an understanding of what multimodality is and what it offers.
Curtis commented, “Why do you need multiple modalities? What I go back to is as humans, we have many modalities available to us. We have our eyes, our ears, our sense of touch, a sense of smell, and our hearing. All of those go together to give us a multimodal experience that makes it much easier for us to make decisions in the real world.”
Many in AI believe that single modality is fine for AI because it can record each stream of information separately. It can then combine that data within its engine. However, in doing so, there is a risk that it loses context between the streams.
That context becomes important when you talk about autonomous vehicles. A human seeing a child run along a pavement would see the child running, looking across the road at a friend, and assume they will run out into traffic. A human can make that assessment, but an AI is unlikely to. This is one area that Curtis sees a need for improvement.
This was a wide-ranging discussion that touched on multiple issues with AI and the use of multimodal AI.
To hear what Curtis talked about, listen to the podcast.
Where can I get it?
You can listen to the podcast by clicking on the player below. Alternatively, click on any of the podcast services below and go to the Enterprise Times podcast page.