The Real AI Bill Isn't the Model — It's Running It

The demo is always cheap. That's the trap, and it catches good teams every quarter.

Here's how it goes. Someone wires a capable model into a workflow, runs it live in a meeting, and it answers ten hard questions in a row. The room nods. The feature gets a green light and a launch date. What nobody prices in that room is the one number that actually decides the feature's fate: what it costs to answer the ten-thousandth question, and the ten-millionth — every day, forever. That number is the running cost, the price of running the model on each request, and unlike a training run or a license fee, it doesn't happen once. It happens every single time someone uses the thing you built.

For thirty-five years I've watched this same shape repeat across every infrastructure wave I've lived through — client-server, virtualization, the public-cloud migration, and now AI. The cost that kills a project is almost never the one on the slide. It's the one that scales quietly with success, shows up two quarters later, and lands on a finance leader's desk with no name attached to it. AI is just the newest, fastest-moving version of that old lesson, and a lot of 2026 roadmaps are about to learn it the expensive way.

So let's take the running cost seriously — where it comes from, how to size it before you roll it out, and how to decide, workload by workload, what belongs on a commercial AI model in the cloud and what belongs on hardware you control.

Why running the model is the line item that grows#

Training a model is a capital event: big, visible, mostly one-time, and — for the overwhelming majority of companies — someone else's problem. You are not training a commercial AI model. You are using one. Your cost is the running cost, and it behaves like an operating expense, because that's exactly what it is. It scales with usage, not with how impressive the pilot looked. The better your feature does, the more it costs to run. That is the opposite of the software economics most leaders grew up on, where serving the millionth user is nearly free.

Three forces make that operating cost bite harder than teams expect.

The hardware is scarce and rented at a premium. The specialized chips that run large models — GPUs — are in heavy demand, and most companies reach them by the hour through a cloud provider. You are renting some of the most expensive computing on the market, priced for scarcity, and the meter never stops. When demand for those chips outruns supply, you don't get a discount for loyalty; you get a queue and a rate card.

The cost tracks work, not seats. Traditional software is priced per user, so your bill is predictable: headcount times a number. Running a model is priced per unit of work — every word in and every word out, measured in tokens. One person who runs long, document-heavy prompts all day can cost more than a hundred colleagues who ask a quick question now and then. Your bill follows behavior you don't fully control, and behavior is much harder to forecast than headcount.

Adoption is the thing you're actively trying to cause. The entire point of launching an AI feature is to get everyone using it. That means success and cost are the same curve, climbing together. It's why “it was cheap in the pilot” proves almost nothing — the pilot was cheap because barely anyone was using it. Roll the same feature out to the whole company and you haven't just multiplied the value; you've multiplied the meter.

Put a real number on it before you roll it out#

Abstract warnings don't change decisions. Numbers do. So here is the exercise I walk clients through — and the numbers below are illustrative on purpose. Swap in your own; the shape is what matters.

Picture an AI feature that summarizes every incoming customer-support ticket so agents can triage faster. Say each summary reads a ticket thread and writes a short digest — call it about 2,000 tokens of work per ticket once you count what goes in and what comes out. Now price it against a commercial AI model at a representative blended rate of roughly $15 per million tokens.

In the pilot, one team runs it on 50 tickets a day. That's about 100,000 tokens a day — a tenth of a million — or roughly $1.50 a day. Call it $45 a month. Nobody notices. Nobody should. At that volume the running cost genuinely is a rounding error, which is precisely why the pilot tells you nothing about the bill.

At full rollout, the whole support organization runs it on 5,000 tickets a day. Same feature, same per-ticket cost — but now you're processing about 10 million tokens a day. That's roughly $150 a day, about $4,500 a month, or on the order of $54,000 a year. For one feature. And a support-summary feature is not exotic reasoning; it's a routine, repetitive task being handed to the most expensive engine on the lot.

Now run the same volume through a small, capable model priced nearer $0.50 per million tokens. Ten million tokens a day becomes about $5 a day — roughly $150 a month, under $2,000 a year. Same work, same volume, a fraction of the cost, because the task never needed a top-tier commercial model in the first place. That gap — tens of thousands of dollars a year versus a couple thousand, for identical output — is the entire argument of this piece, sitting in a single feature. Multiply it across the dozen AI features a company launches in a year and you can see how a program that looked cheap in every individual demo becomes the line item finance circles in red.

The discipline is simple to state and rare to practice: model the run cost at full volume before you roll it out, not the demo cost. Estimate real request volume at full adoption — the whole organization, not the pilot team. Estimate the work per request honestly, including the long, ugly inputs. Multiply it out at your provider's real rate. That one number reframes the decision, and it does so while you can still change the design cheaply.

The pattern this is pushing: rent to learn, own to run#

None of this is an argument against the cloud, and it is emphatically not an argument against commercial AI models. The big commercial models in the cloud are one of the most useful tools your teams have ever been handed. The argument is narrower and more practical: stop sending every workload to the biggest, newest model by reflex, and start matching each one to the home that fits it.

What's emerging in practice is a hybrid split, and it rhymes with something a lot of us already lived through. Keep experimentation and genuinely hard reasoning in the cloud, where the newest chips live and where you want the freedom to try things without buying anything. Move the steady, predictable, high-volume work onto capacity you control — reserved instances you've committed to, a small model running on your own hardware, or an open-weight model you host — where the per-request economics can be a fraction of the rented commercial-model rate.

If that sounds familiar, it should. It's the same arc as the public-cloud story. Companies moved everything to the cloud because renting was the right call while they were still learning what they needed. Then usage got steady and predictable, the bill got large, and finance started asking which of those steady workloads could come back in-house at a lower run rate. AI is running the same play on a compressed timeline. You rent while you're learning. You own when the load is steady and you finally know your number. The teams that get burned are the ones that treat “rent everything, always” as a permanent architecture instead of a phase.

“Own,” in 2026, rarely means building a data center. More often it means a committed-use discount on cloud capacity, a right-sized open-weight model running on a modest amount of dedicated hardware, or a managed service that lets you run smaller models cheaply without operating anything yourself. The point isn't the hardware. The point is deciding, on purpose, where each workload runs — instead of letting every workload drift to the default.

The three-question test#

When a workload lands on my desk and the question is “a commercial AI model in the cloud, or a smaller one on our own hardware?”, I don't reach for an ideology. I run it through three questions. All three are about fit, and any one of them can settle the call.

1. How hard is the reasoning, really? Be honest about the task in front of you, not the impressive thing the model can do in general. If the work is routine — classifying, extracting, summarizing, routing, reformatting — a small model is usually more than enough, and small models run cheaply on modest hardware. Reserve the commercial AI models for genuinely open-ended reasoning, where the quality difference is real and worth paying for. When you actually audit a company's AI workloads, most of them turn out to be routine work wearing a commercial-model price tag. That's not a knock on the teams; it's just what happens when the biggest model is the easiest one to reach for.

2. How sensitive is the data? If you'd hesitate to email the input to an outside vendor, that hesitation is information. Workloads touching regulated, confidential, or contractual data are strong candidates for staying on hardware you control, where the data never leaves your walls. This isn't only a security posture — it's frequently a compliance requirement, and keeping the model next to the data is often the cleaner path to satisfying an auditor than routing sensitive records through a third party and then explaining the data flow. The cheaper run cost, in these cases, is a bonus on top of the control you needed anyway.

3. How predictable is the volume? Spiky, experimental, unpredictable work rewards renting — you pay for what you use and walk away when you're done. Steady, high-volume, always-on work rewards owning — the per-request economics flip once utilization is high and constant, because you're no longer paying a scarcity premium on every single call. The question to ask isn't “how much traffic do we have,” it's “how predictable is that traffic?” Predictability, not raw size, is what makes owning pay.

Run a real workload through those three and the answer is usually clearer than the debate around it. Most companies discover their AI features split cleanly into two piles: a small set of hard, sensitive, or bursty tasks that genuinely belong on a commercial AI model in the cloud, and a large tail of routine, high-volume, predictable work that is quietly overpaying to run on a rented commercial model. The second pile is where the money is.

How This Impacts Your Organization#

The principle doesn't change with company size. What changes is your set of options, the shape of your risk, and where the leverage sits. Here's how the run-cost question lands for each kind of organization.

Large Enterprises (1,000+ employees)

You already run a mix of cloud and owned infrastructure, and you have the volume to make owning hardware genuinely pay off for steady workloads. Your risk isn't capability — it's fragmentation. Dozens of teams, each reasonably defaulting to the biggest model, and no single person holding the aggregate bill until it arrives. The fix is ownership in the organizational sense before the hardware sense: put a named owner on AI run-cost the same way you already have one for cloud spend, give them visibility across teams, and set a default policy that routine, high-volume workloads run on smaller or owned models unless someone makes the case for a commercial model. You have the leverage to negotiate committed-use rates and the scale to justify dedicated capacity for your steadiest workloads. Use both — and make “which model, and why” a question your architecture reviews actually ask, with the run-cost number attached.

Mid-Size Organizations (100–999 employees)

You feel the bill fastest, because a single runaway workload is a much bigger share of your budget and you don't have a data-center team to absorb the mistake. Don't overcorrect into building a GPU room you can't staff — that trades a predictable cloud bill for an unpredictable operational one, which is the same error companies made rushing INTO the cloud, run in reverse. Your win is focus and timing: model the run cost before you roll any feature out organization-wide, identify the two or three workloads that are genuinely high-volume, and route those to the cheapest model that clears your quality bar, using managed options that let you run smaller models without operating hardware yourself. Keep a commercial AI model on tap for the hard exceptions. You don't need a platform team to do this; you need the discipline to ask the question before the rollout, not after the invoice.

Small & Growing Organizations (Under 100 employees)

Stay in the cloud. Owning hardware is the wrong move at your size, and your volume is usually low enough that commercial-model pricing is survivable while you learn what actually delivers value. Your discipline is a different one, and it's lightweight: turn on a spend alert so a runaway feature can't surprise you, watch for the single automation that quietly becomes your largest line item, and don't wire a commercial AI model into a high-frequency, always-on process without first checking what it costs at full volume. “Rent while you learn” is exactly the right posture for you — you just want to keep one eye on the meter as you grow, so that the day a feature does take off, the cost is a decision you made rather than a bill you discover.

What to do Monday morning#

You don't need a task force to get ahead of this. You need one afternoon and a willingness to look at real numbers. Here's the sequence I'd run.

This week: list every AI feature you have in production or about to launch. For each one, write down the honest full-volume request count and the work per request, and multiply it out at your provider's real rate. Sort the list by annual run cost, largest first. The top of that list is where your attention belongs — and it's usually shorter than people fear.

Next week: take the top one or two workloads and ask the three questions of each — how hard is the reasoning, how sensitive is the data, how predictable is the volume. For any workload that's routine, non-sensitive, and steady, pilot a smaller or owned model against it and compare quality head to head. You're not looking for the model to be perfect; you're looking for it to clear your bar at a fraction of the cost.

Within the month: set a default. Write down, in one page, which kinds of workloads run on a commercial AI model in the cloud and which run on something cheaper — and make that page the thing your teams check before they wire in a model. Add a spend alert so the next runaway feature announces itself early. That's the whole program: a number, a test, a default, and an alarm.

The teams that win the next two years won't be the ones that used the biggest model everywhere. They'll be the ones that knew, workload by workload, which model to use where — and could show the number behind every choice. Speed still matters. But a feature that gets switched off because nobody modeled its run cost never got the chance to matter at all. Price the running of it before you roll it out, and you get to keep the AI you build.

#ai#finops#infrastructure

About the author

Charles Redding

Founder of DLegendDigital. 35+ years of enterprise technology leadership across audit, risk management, cybersecurity, and AI. Former CIO, VP of Technology, and Director at organizations ranging from high-growth startups to $4.3B global enterprises.

The Real AI Bill Isn't the Model — It's Running It

Why running the model is the line item that grows#

Put a real number on it before you roll it out#

The pattern this is pushing: rent to learn, own to run#

The three-question test#

How This Impacts Your Organization#

Large Enterprises (1,000+ employees)

Mid-Size Organizations (100–999 employees)

Small & Growing Organizations (Under 100 employees)

What to do Monday morning#

Get next week's brief before it hits the blog.

Keep reading

The Cloud You Can't Afford to Leave — and the January Deadline That Changes the Math

Customers Are Microsoft's QA Department Now — A Proposed Wait Period for Windows 11 Patches

Your Real Vendor List Has 50,000 Names — and April 2026 Proved It