Can tech firms come to appreciate more affordable AI models? 

Can tech firms come to appreciate more affordable AI models? 

The surge in AI has been founded on a fundamental premise: Larger models equate to greater power, and the most potent models prevail. Now, the sector is poised to discover what unfolds if that premise begins to falter.

Rising expenses have already compelled users to reconsider smaller and more affordable models. This economical model exploration is a novel concept, and its effects on the industry remain uncertain, but the repercussions are expected to be considerable.

One forecast, articulated best by Brian Armstrong, co-founder of Coinbase, is that it will lead to a significant portion of tasks transitioning to more economical models.

“[D]emand for intelligence is nearly limitless, but 80% of tasks will utilize 99% cheaper models within 12-18 months,” Armstrong shared on X. “20% of tasks will continue to utilize the latest generation models where maximizing IQ is crucial.”

It’s difficult to overemphasize how monumental a transformation it would be for the AI industry if Armstrong’s forecast materializes.

Up until now, most AI firms have competed based on quality, which has typically meant resorting to the most advanced model available. If those same tasks can be managed by less expensive models without compromising quality, it would signify a dramatic shift in the economics of AI. Importantly, much of the cost savings would come at the expense of the major labs, delivering a financial setback to OpenAI and Anthropic just as they approach their IPOs.

This represents a potentially groundbreaking shift in the industry, hinging on a fundamental question: Are companies prepared to transition to smaller models?

Preliminary experiments indicate that, when configured appropriately, more affordable models could substitute in without compromising quality. In a recent examination by the legal AI platform Harvey, the organization managed to decrease inference costs by three times without sacrificing quality. This test, executed in collaboration with the inference provider Fireworks AI, integrated Claude Opus and Fireworks’ GLM 5.1, and pivoted to Opus for the most demanding tasks. The outcome was a markedly lower demand in terms of server time and overall expenses.

“Quality is paramount, and in legal, it always will be,” Harvey co-founder Gabe Pereyra conveyed to TechCrunch, when discussing the AI legal services his firm offers. “Nonetheless, the notion of quality is changing from merely utilizing the most powerful model for all scenarios, to employing the optimal model that delivers the correct answer most efficiently.”

This trend is frequently framed in terms of major labs versus Chinese models or open-weight alternatives, but this overlooks the broader issue. The genuine divide isn’t between proprietary and open models; it’s between large models and smaller ones. You can achieve cost reductions by shifting from GPT-5.5 to DeepSeek’s V4 Flash, but transitioning to GPT-5.4-mini is equally effective.

There’s an ongoing price competition happening between internal inference from the major labs and independently hosted open-weight models. Regarding the overarching question of small versus large, it doesn’t truly matter which kind of smaller model prevails.

All of this may appear evident — clearly, one shouldn’t utilize more computation than necessary — but it contradicts the scaling-first mentality that has prevailed in the industry until now. Motivated by the harsh realities, labs have aggressively focused on training the most computation-heavy models achievable, pushing the limits of what AI models can accomplish. With prices substantially subsidized by investors, clients had no incentive to select anything other than the most advanced option.

As token prices increase and subsidies begin to wane, users are encountering cost pressures for the first time. It remains uncertain whether this newfound cost pressure will genuinely drive enterprise users toward smaller models. They might equally economize by making fewer requests, utilizing less context, or simply abandoning the least viable deployments.

However, if it turns out that most deployments can operate effectively on a smaller model, it could significantly dampen the rising demand for inference — and spark new discussions about how to rationalize the expenses associated with training a cutting-edge model.

When you click on links in our articles, we may earn a small commission. This does not influence our editorial independence.