Contributing writer

Expectations vs. reality: A real-world check on generative AI

Feature
May 01, 202411 mins
Artificial IntelligenceDevelopment ToolsEmerging Technology

Now with the benefit of hindsight, organizations are more aware of moving cautiously to ensure gen AI delivers rather than disappoints.

Two people review information on a tablet in an office workspace.
Credit: Gorodenkoff / Shutterstock

Is generative AI so important that you need to buy customized keyboards or hire a new chief AI officer, or is all the inflated excitement and investment not yet generating much in the way of returns for organizations?

Gen AI takes us from single-use models of machine learning (ML) to AI tools that promise to be a platform with uses in many areas, but you still need to validate they’re appropriate for the problems you want solved, and that your users know how to use gen AI effectively.

For every optimistic forecast, there’s a caveat against a rush to launch. Multiple studies suggest high numbers of people regularly use gen AI tools for both personal and work use, with 98% of the Fortune 1000 experimenting with gen AI, according to a recent PageDuty study. But now organizations appear to be taking a more cautious approach when it comes to official deployments.

For example, a quarter of IT decision-makers in Foundry’s 2023 AI Priorities Study are piloting gen AI technologies, but only 20% have moved on to deployment. Senior leaders in CCS Insight’s Employee Technology and Workplace Transformation Survey gave similar responses: by the end of 2023, 18% had already deployed gen AI to their full workforce, and 22% were ready to deploy. “People want to see it be real this year,” says Bola Rotibi, chief of enterprise research at CCS Insight. But talking to IT teams like the AI professionals in Intel’s 2023 ML Insider survey suggests only 10% of organizations put gen AI solutions into production in 2023.

Ready to roll 

It’s shorter to make a list of organizations that haven’t announced their gen AI investments, pilots, and plans, but relatively few are talking about the specifics of any productivity gains or ROI. But that may be as much about protecting any competitive advantage as it is about any lack of success.

For example, many Google customers, like Goldman Sachs, IHG, and Mercedes Benz, talking about building with its Gemini gen AI tools at the recent Google Cloud Next conference turned out to still be at the pilot stage rather than in deployment.

Pilots can offer value beyond just experimentation, of course. McKinsey reports that industrial design teams using LLM-powered summaries of user research and AI-generated images for ideation and experimentation sometimes see a reduction upward of 70% in product development cycle times. But it also emphasizes that those design teams need to do significant evaluation and manipulation of gen AI output to come up with a product that’s realistic and can actually be manufactured, and the recommendation is still to set policies, educate employees, and run pilot schemes. Similarly, Estée Lauder sees value from pilots like an internal chatbot trained on customer insights, behavioral research, and market trends to make those analytics more broadly available in the business, but is still working on how to actually deliver that value.

When it comes to dividing gen AI tools into task and role-specific vertical applications, or more general tools that can be broadly useful to knowledge workers, organizations seem able to adopt the latter more quickly.

As expected, Microsoft claims its own staff gets significant value from the gen AI tools it has in market, like Copilot for Microsoft 365. “Our best users are saving over 10 hours a month,” says Jared Spataro, CVP, modern work and business applications at Microsoft, and 70% of Copilot users say it makes them more productive, working up to a third faster.

Customers like Telstra report similar time savings for their early adopters, although Forrester lead analyst on Copilot for Microsoft 365 JP Gownder suggests five hours a month is a more common gain. The other question is how well that will scale across the organization. Large Japanese advertising agency Dentsu, for instance, is very enthusiastic about Copilot for Microsoft 365, claiming staff save up to 30 minutes a day on tasks.

Adoption of Copilot so far tends to be in what he refers to as pockets, which matches how McKinsey reports that most gen AI deployments are happening in specific departments: marketing and sales, service and support, and product development.

Telcos surveyed by McKinsey demonstrated the same blend of optimism and restraint as other industries, with a majority claiming to have cut costs with gen AI, and seen increases in call center agent productivity and improvement in marketing conversion rates with personalized content — both with models deployed in weeks rather than months. On the other hand, the impact has been low outside customer service or mapping network infrastructure.

Organic growth

Some of Microsoft’s original test customers have already moved from pilot to broad deployment. One of the earliest Microsoft 365 Copilot trials was at global law firm Clifford Chance, and the company is now deploying it to the entire workforce, alongside its custom AI tool, Clifford Chance Assist, built on Azure OpenAI. The company is careful to note that any legal output from gen AI is clearly labelled and checked by a qualified lawyer but, again, the main benefits are productivity gains for knowledge workers: live transcripts, meeting summaries, and both implicit commitments and agreed-on tasks from those meetings.

“This is an incredible technology that can raise productivity, save time, and be a great human assistant,” says Gownder. “But it’s different from the tools we’ve been releasing over the last 40 years in computing. It has these characteristics you need to learn about to be truly successful.”

He offers a string of questions to assess the AI quotient of your organization:

  • Do you have a basic understanding of how AI and prompt engineering work?
  • Have you had training?
  • Do you feel confident about being able to learn these things?
  • Are you motivated to get involved?
  • Are you aware of what can go wrong and how you can be an ethical user of these things?

Another issue is getting staff to make gen AI tools part of their workflow. “Some people are really bullish on Copilot and say they’re having a great experience with it,” adds Gownder. Others find bumps in the road, though, where half of users see productivity gains and the other half doesn’t use the tools. Frequently, that’s because enterprises are underinvesting in training by an order of magnitude.

Almost every major company evaluating Copilot for Microsoft 365 is only planning on an hour of training for staff instead of the 10 he suggests. “This is a core skill and you need to invest in training here because otherwise it’s going to bite you,” he says. That’s key both for gen AI deployments to succeed, and to get the most out of the gen AI features and natural language interfaces that’ll become common in commercial software, from Photoshop to Zoom.

Very specific successes

There are gen AI success stories in verticals like document engineering, where Docugami offers custom small language models that build a knowledge graph from a customer’s own complex documents, and can be used for both document generation and to extract data.

And commercial insurance is a vertical Docugami CEO Jean Paoli says has been an early adopter, including statements of value, certificates of insurance, as well as policy documents with renewal dates, penalties, and liabilities. That’s critical information describing the risk of both individual customers and the entire portfolio, which has been difficult to manually extract and consolidate to use for generating new quotes, or representing the portfolio to reinsurers. “These are real scenarios that save you millions of dollars, not few hundred bucks,” Paoli says.

Like everyone else, large Docugami customers created gen AI committees and started pilots in 2023, but many have already moved from discovery to implementation, starting production deployments at least six months ago and seeing real returns, chief business officer Alan Yates says. In life sciences, one customer uses the platform for clinical trial documentation, compliance, and data exploration. “It took them six months to do this work previously and now it takes them a week,” he says.

Coding is another vertical where adoption of gen AI in production is increasingly common, whether that’s GitHub Copilot, Google’s new Gemini Code Assist, AWS CodeWhisperer, or tools like ChatGPT that aren’t developer specific.

Productivity improvements can be much lower initially, though. When Cisco first rolled out GitHub Copilot to 6,000 developers, they only accepted the generated code 19% of the time. Now nearly half of code suggestions are accepted. Saving just six minutes of developer time a month is enough to cover the cost, according to Redfin, although there are other metrics like code quality that organizations will want to track as well.

But the gen AI gains can also be much higher for low code platforms where citizen developers with less expertise get more benefit from the assistance. Digital insurance agency Nsure.com was already using Power Automate extensively, but describing an automation flow in natural language is much faster than even a drag and drop interface. Workflows that took four hours to create and configure take closer to 40 minutes with Copilot for Power Automate, an improvement of over 80%.

Then there’s Microsoft customer PG&E, which built an IT helpdesk chatbot called Peggy with the low code Copilot Studio gen AI tool in Power Platform that handles 25 to 40% of employee requests, saving over $1.1 million annually, principal program manager for Microsoft Copilot AI Noa Ghersin says. And having Peggy walk employees through unlocking their access to SAP saves the helpdesk team 840 hours a year alone.

Organizations that have already adopted Power Platform for low code and RPA find they can make that automation more powerful using Copilot Studio to orchestrate processes where there are multiple workflows to choose from, like ticket refunds for Cineplex. Agents used to spend five to 15 minutes processing a refund even with automation, and now that’s 30 to 60 seconds.

Counting the cost

Set monthly subscriptions can seem expensive, but it’s hard to accurately estimate costs for on-demand gen AI tools, which may gate some deployments. The costs for individual gen AI tasks can be pennies, but even small costs add up.

“Cost is a primary thing you have to take into account in gen AI, whether you go to third-party vendors or even internally,” says LinkedIn principal staff software engineer Juan Bottaro. His team recently rolled out a new gen AI feature for premium users that uses your profile to suggest if you’re a good match for a job posting, and what skills or qualifications might improve your chances.

“There were several times where we would’ve liked to move much faster because we felt the experience was a lot more mature, but we had to wait because we just didn’t have enough capacity and GPUs available,” he says.

It’s hard to predict costs for novel workflows, and any assumptions you make about usage will probably be wrong because the way that people interact with this is very different, he adds. Instead, deploy to a small percentage of users and extrapolate from their behavior.

Initially, you may see cost savings because the speed of prototyping is dramatically and almost deceptively fast. Training and testing a classifier to understand intent typically takes one to two months, but his team was able to get prototypes of what they wanted to deliver in just a couple of days. “In a week, you can get something that looks like a finished product,” says Bottaro. “We managed to build something that looks very close to what you see today in the premium experience in a month or two.”

But getting from something that’s 80% of what you want to the level of quality you need to deploy will often take much longer. In this case, another four months.

It’s still too early to learn lessons from either technical or cost control failures in gen AI pilots, CCS Insight’s Rotibi says, but users can consider quotas and rate-limiting outbound requests to cloud AI services through API management gateways, just like other cloud services. The majority plan to limit the use of gen AI to targeted roles, individuals, or teams because of the pricing. “That’s a lot of money if you want to go across the organization,” she says.

What are you measuring?

Self-reported productivity isn’t necessarily the best way to measure gen AI deployment success, and successful deployments may even change what metrics matter, Gownder says. “If you’re pushing your entire tier-one support off to generative AI and you have really good natural language, the success rate will go up, so everything that gets to a human is a harder problem,” he says. “It’s more long tail and white-glove hand holding, and the metric is more about customer satisfaction than the length of the call.”

Just measuring the quality and accuracy of gen AI results is difficult given that it’s non-deterministic; the same inputs will likely give you a different result every time. That’s not necessarily a flaw if they’re correct and consistent, but does make it harder to evaluate, so unless you have an existing tool to compare it to, you have to create a benchmark for evaluating performance.

“Defining whether something is right or wrong becomes very subjective and difficult to measure,” Bottaro says.

To evaluate the tool, the team created shared guidelines for what a good response looks like. Similarly, for the Ask Learn API powering Copilot for Azure, Microsoft built a ‘golden dataset’ of representative, annotated questions and answers with reference data for ground truth to test against — and metrics to represent — answer quality.

Organizations are often more interested in whether they make money than save it by deploying gen AI, notes Rotibi. “I can see this as a productivity capability and an efficiency improvement for my workforce,” she says. “But where am I going to make money as an organization?”

There’s pressure to demonstrate the uptake of true ROI, Gownder adds, but warns we’re not at that point yet. It may be easier to connect role-specific tools like Copilot for Sales to improvements in conversion rate, deal flow, or the mean time to resolution of a call, but he cautions against assuming a direct causal relationship when there are so many variables.

Less quantifiable benefits can still be valuable in terms of TCO, though. “Let’s say giving people Copilot not only saves them time, but takes tedious tasks off their plates,” says Gownder. “That could improve their employee experience. We know employee experience benefits tend to lower attrition, and make people more motivated and engaged. There’s a lot of positive productivity from the psychological side of that.”

But sheer enthusiasm for gen AI and LLMs complicates things, says Bottaro: “We’re faced with a problem of, ‘Let’s find out how to measure value because I definitely want to build it.’ That’s looking at it the wrong way round.” He suggests going back to the same objective function of success metrics you’d use for any product, and being open to the possibility that for some use cases, traditional AI will be good enough.

Is gen AI failing?

There are valid questions about where it’s appropriate to adopt gen AI, how to stop users accepting inaccurate answers as irrefutable truths, and the concerning inclusion of both copyright and inappropriate material in training sets. But negative publicity and scaremongering can exaggerate risks and ignore the useful things you can already do if you adopt gen AI responsibly.

Reported gen AI failures are often as much about irresponsible behavior by users testing boundaries, or organizational failure to launch AI-powered tools to put sufficient guardrails in place, as it is about the inherent issues of the models themselves. Embarrassingly, at one point in 2023, OpenAI’s own $175 million VC fund was under the control of a fake identity, but that appears to be just another example of someone using AI-powered tools to help them with good old-fashioned business fraud.

Other concerns about gen AI involve deepfakes or simpler digital forgeries, potential legal risks around copyright of data used for the training set, and questions about compliance when using gen AI with sensitive or confidential data.

As with any cloud model, the notion of shared responsibility is key. AI providers need to supply models and services that are safe to use, but organizations adopting AI services must read the model cards and transparency notes, and test they’re adequately constraining the way they can be used.

“Some organizations have overextended to the customer with chatbots and realize they’re getting inconsistent answers,” Gownder says. But that doesn’t usually mean abandoning the project. “Maybe they pull it back and try to iterate offline before they launch it to customers,” he adds.

Organizational maturity in gen AI tends to track maturity in AI generally, and most companies adopting it say it’s helping them invest elsewhere. “They’re investing more in predictive AI, computer vision, and machine learning,” says Gownder. Businesses building their own AI tools are using multiple technologies and treating gen AI as a component rather than a solution.

The best correction to gen AI hype is to view it as both a groundbreaking technology and just another tool in the toolbox, says Bottaro.