Sama has launched Sama Multimodal. It allows AI to capture multiple types of content, including images, video, text, audio and LiDAR to create more accurate AI systems. The goal is to improve model accuracy across a range of industries such as automotive, and retail. The company claims that initial results with one large retailer saw a 35% increase in model accuracy and a 10% reduction in product returns.

Duncan Curtis, SVP of AI Product and Technology at Sama, said, “With Sama Multimodal, organizations can build differentiated AI solutions using the full spectrum of data available, including sensor data, which is growing ever more prolific.
“What makes our platform truly unique is its flexibility—teams can ingest, align, and annotate any combination of modalities, then transition from pre-trained to proprietary models at the right moment in their development workflow. It’s designed to evolve with AI itself.”
Multimodal captures more than more data
When we view the world around us, we do so in far more than just text terms. The human senses give us many ways to understand where we are, and what we see, hear, taste and feel. For AI, capturing that data has been complex. It lacks the sensory inputs to understand and evaluate the data.
The stereoscopic vision of the human eyes and the use of the rods and cones inside the eye allow us to identify objects and create a 3D view that tells us if they pass in front of or behind other objects.
Compare that to security cameras and systems. At night, it can be difficult to distinguish objects and the depth of the field. That means that many security systems fail when it comes to understanding when an intruder has passed behind or in front of an object.
The same problem is true for autonomous cars. They need to understand distance, depth of field and determine what the objects are that they are observing. While LiDAR has delivered a usable solution, greater fidelity in the data it gathers is still needed by the onboard AI for truly autonomous vehicles.
Compare this to the challenge a retailer faces in an automated packing line. It is not as simple as just scanning a bar code and having robots place objects into boxes. To speed up and fully automate that packing line, the system has to detect that it has the right object. If this is fresh food, it has to detect if the food is fit for consumption, not diseased and then how to pack it appropriately.
These are just some of the things that Sama Multimodal is designed to resolve.
How does Sama Multimodal work?
Sama Multimodal is a framework that works with the company’s Agentic Capture framework. Both sit on top of Sama’s extensible data labelling platform, which delivers consistency in identifying objects.
Those objects are identified from the different types of input that Sama captures, and are then validated by humans. This use of the Human in the Loop (HitL) ensures accurate verification of objects for the AI to learn from. It also provides a quality feedback mechanism to continually tune the AI recognition engine.
In addition to improving quality, Sama says that this will provide a mechanism for removing bias in the AI. That will be especially important in any system that recognises humans, such as security or automotive.
The framework comes with its own set of predetermined components to help developers build applications. Importantly, the framework is also open. That allows developers to build and add their own components. Additionally, it means that they can also plug in any AI model that they want. It prevents lock-in and opens up a wider market for application developers using the Sama tools.
Initially, Sama is targeting two industries with Sama Multimodal. In the release, it says, “In retail applications, for example, Sama’s multimodal capabilities significantly improve search relevance applications and product discovery with a combination of image, text, and video annotations.
“In automotive, Sama Multimodal excels at integrating camera, LiDAR, and radar data to create more comprehensive environmental understanding for advanced driver assistance systems and autonomous vehicles.”
Enterprise Times: What does this mean?
There is a rush by AI vendors to add support for multimodal content to their AI engines. Most are targeting specific industries and tuning their solutions accordingly. Sama, however, has taken the approach of building multimodal into multiple parts of its product line.
Earlier this year, it released its Agentic Capture solution, which it calls a feedback framework for multimodal agentic AI. Now, with Sama Multimodal, it is delivering a tool for developers to build applications that support any type of input. Bring the two together, add in HitL, and you have the ability to not only capture multimodal input but build fully automated systems that use it.
It will be interesting to see where Sama goes next and how quickly its multimodal products are adopted. While it currently has two industries in its sights, there are many more it already supports. Will we see additional components that plug into the frameworks for those industries? Alternatively, will we see specialised Sama models for each industry?
Enterprise Times is due to record a podcast with Duncan Curtis to talk about multimodal AI. We will be asking him what the future holds.