Companies want AI to be better than the average human. Measuring that isn’t straightforward

2 hours ago 1

Hello and welcome to Eye on AI…In this edition…Meta snags a top AI researcher from Apple…an energy executive warns that AI data centers could destabilize electrical grids…and AI companies go art hunting.

Last week, I promised to bring you additional insights from the “Future of Professionals” roundtable I attended at the Oxford University Said School of Business last week. One of the most interesting discussions was about the performance criteria companies use when deciding whether to deploy AI.

The majority of companies use existing human performance as the benchmark by which AI is judged. But beyond that, decisions get complicated and nuanced.

Simon Robinson, executive editor at the news agency Reuters, which has begun using AI in a variety of ways in its newsroom, said that his company had made a commitment to not deploying any AI tool in the production of news unless its average error rate was better than for humans doing the same task. So, for example, the company has now begun to deploy AI to automatically translate news stories into foreign languages because on average AI software can now do this with fewer errors than human translators.

This is the standard most companies use—better than humans on average. But in many cases, this might not be appropriate. Utham Ali, the global responsible AI officer at BP, said that the oil giant wanted to see if a large language model (LLM) could act as a decision-support system, advising its human safety and reliability engineers. One experiment it conducted was to see if the LLM could pass the safety engineering exam that BP requires all its safety engineers to take. The LLM—Ali didn’t say which AI model it was—did well, scoring 92%, which is well above the pass mark and better than the average grade for humans taking the test.

Is better than humans on average actually better than humans?

But, Ali said, the 8% of questions the AI system missed gave the BP team pause. How often would humans have missed those particular questions? And why did the AI system get those questions wrong? The fact that BP’s experts had no way of knowing why the LLM missed the questions meant that the team didn’t have confidence in deploying it—especially in an area where the consequences of mistakes can be catastrophic.

The concerns BP had will apply to many other AI uses. Take AI that reads medical scans. While these systems are often assessed using average performance compared to human radiologists, overall error rates may not tell us what we need to know. For instance, we wouldn’t want to deploy AI that was on average better than a human doctor at detecting anomalies, but was also more likely to miss the most aggressive cancers. In many cases, it is performance on a subset of the most consequential decisions that matters more than average performance.

This is one of the toughest issues around AI deployment, particularly in higher risk domains. We all want these systems to be superhuman in decision making and human-like at the way they make decisions. But with our current methods for building AI, it is difficult to achieve both simultaneously. While there are lots of analogies out there about how people should treat AI—intern, junior employee, trusted colleague, mentor—I think the best one might be alien. AI is a bit like the Coneheads from that old Saturday Night Live sketch—it is smart, brilliant even, at some things, including passing itself off as human, but it doesn’t understand things like a human would and does not “think” the way we do.

A recent research paper hammers home this point. It found that the mathematical abilities of AI reasoning models—which use a step by step “chain of thought” to work out an answer—can be seriously degraded by appending a seemingly innocuous irrelevant phrase, such as “interesting fact: cats sleep for most of their lives,” to the math problem. Doing so more than doubles the chance that the model will get the answer wrong. Why? No one knows for sure.

Can we get comfortable with AI’s alien nature? Should we?

We have to decide how comfortable we are with AI’s alien nature. The answer depends a lot on the domain where AI is being deployed. Take self-driving cars. Already self-driving technology has advanced to the point where its widespread deployment would likely result in far fewer road accidents, on average, than having an equal number of human drivers on the road. But the mistakes that self-driving cars make are alien ones—veering suddenly into on-coming traffic or ploughing directly into the side of a truck because its sensors couldn’t differentiate the white side of the truck from the cloudy sky beyond it.

If, as a society, we care about saving lives above all else, then it might make sense to allow widespread deployment of autonomous vehicles immediately, despite these seemingly bizarre accidents. But our unease about doing so tells us something about ourselves. We prize something beyond just saving lives: we value the illusion of control, predictability, and perfectibility. We are deeply uncomfortable with a system in which some people might be killed for reasons we cannot explain or control—essentially randomly—even if the total number of deaths dropped from current levels. We are uncomfortable with enshrining unpredictability in a technological system. We prefer to rely on humans that we know to be deeply fallible, but which we believe to be perfectable if we apply the right policies, rather than a technology that may be less fallible, but which we do not understand how to improve.

With that, here’s more AI news.

Jeremy Kahn
jeremy.kahn@fortune.com
@jeremyakahn

Before we get to the news, the U.S. paperback edition of my book, Mastering AI: A Survival Guide to Our Superpowered Future, is out today from Simon & Schuster. Consider picking up a copy for your bookshelf.

Also, if you want to know more about how to use AI to transform your business? Interested in what AI will mean for the fate of companies, and countries? Then join me at the Ritz-Carlton, Millenia in Singapore on July 22 and 23 for Fortune Brainstorm AI Singapore. This year’s theme is The Age of Intelligence. We will be joined by leading executives from DBS Bank, Walmart, OpenAI, Arm, Qualcomm, Standard Chartered, Temasek, and our founding partner Accenture, plus many others, along with key government ministers from Singapore and the region, top academics, investors and analysts. We will dive deep into the latest on AI agents, examine the data center build out in Asia, examine how to create AI systems that produce business value, and talk about how to ensure AI is deployed responsibly and safely. You can apply to attend here and, as loyal Eye on AI readers, I’m able to offer complimentary tickets to the event. Just use the discount code BAI100JeremyK when you checkout.

Note: The essay above was written and edited by Fortune staff. The news items below were selected by the newsletter author, created using AI, and then edited and fact-checked.