AI – if you had a word cloud for the most common words among people who work in and follow the tech industry, those two letters would dwarf the rest. Despite the excitement around the technology, assessing its impact remains challenging for tech leaders like CIOs. A Gong research found there was still little consensus for defining the success of their AI projects.
While over half (53%) prioritise productivity gains, the same proportion focus on revenue growth as a key success metric, trailed closely by worker satisfaction (46%). This illustrates confusion about where AI solutions deliver the most business value and how to evaluate it.
Measurement matters
As enterprise AI matures, corporate decision-makers are becoming more intent on proving the value of teams’ investments in the technology. And herein lies the challenge. Without consensus around what or how to evaluate AI’s outputs, it’s impossible to determine whether an AI solution is truly delivering value or meeting its intended goals.
Tech leaders need to show how their AI projects serve the specific use-cases they are designed for, to justify the value of their investments. Spending on these AI projects were expected to surpass £400 billion in 2024. For example, a large technology company using AI to forecast sales needs to prioritise accuracy over speed. Meanwhile, a retailer using generative AI to create marketing material is better off capturing the sentiment that aligns with customers’ pain points rather than perfectly nailing the brand’s tone.
Applying measurement to know what’s working and what isn’t
As it stands, there is no industry standard for comparing LLMs. What we can do is measure their performance relative to previous iterations or alternatives of a similar solution. Ultimately, that’s what matters more than just comparing two underlying models – what they do when faced with a prompt and input data.
At Gong, we use a system known as Elo to do this. Originally developed for ranking competitive chess players, it can also serve as the basis for evaluating and comparing the relative performance of our generative AI applications. This approach is also taken by LMSYS, an AI research organisation considered by many the de facto rating standard for foundational LLMs.
In chess, Elo puts a number against a player’s skill relative to another. When used to evaluate AI applications, it assigns ‘scores’ to outputs from competing versions of applications. When we pit an older version of an application against an updated one based on a different underlying LLM, their ratings are based on the outcomes they each produce, so we can clearly see which version is outperforming. In practise, it follows these steps:
- Setup comparison: We create a scenario in which an old and new generative AI application based on differing models are tested under similar conditions to minimise external variables. Outcomes are then categorised as ‘old win’, ‘new win’ or ‘tie’, to represent whether the new model or application is really outperforming its predecessor.
- Calculate win rate: Each ‘new win’ scores a point; each ‘tie’ scores half a point. The total points are then divided by the total number of possible points to get the win rate.
- Calculate the Elo difference: Using this win rate, we calculate the Elo difference between the old and new application. We can then evaluate how the underlying LLM – the primary variable – impacts the value the application is able to deliver to customers.
Ultimately, this allows us to ensure we’re building and shipping the best products that enable our customers to optimise their revenue organisations, and communicate exactly how we know this to those customers.
Putting measurement at the core of the development cycle
For businesses to maximise the value they get from their AI investments, measurement must be entrenched and ongoing in their teams’ development workflows to ensure continuous improvement. Leaders can support developers and product teams in this process by enabling them to establish a baseline set of test cases, mitigating the need to source new datasets for every update. As best practice, benchmarks should be published for customers in the interest of transparency and to maintain an ongoing dialogue around service improvement.
Developing and integrating AI applications that continue to drive value is an ongoing process. As enterprise AI use cases mature and we deploy them in new ways, the tools we have to evaluate them become increasingly important to ensure resources aren’t wasted. The billions being invested reflect the potential CIOs see in AI. However, the only way to know if we’re approaching it is by measuring its business impact and consistently looking to improve it.
Gong empowers everyone in revenue teams to improve productivity, increase predictability, and drive revenue growth by deeply understanding customers and business trends; driving impactful decisions and actions. The Gong Revenue AI Platform captures and contextualises customer interactions, surfaces insights and predictions, and powers actions and workflows that are essential for business success. More than 4,000 companies around the world rely on Gong to unlock their revenue potential.