Home / Glossary / Glossary

Inference Cost

Inference cost is the computational expense incurred each time an AI model processes an input and generates an output. For large language models and multimodal AI systems, inference cost is measured in compute units consumed per request, typically expressed in dollar cost per million input tokens and dollar cost per million output tokens. It is the primary operational cost for AI companies delivering model-based services at scale, and the most important determinant of gross margin quality in AI application businesses.

inference costAI economicsAI valuationmachine learningAI infrastructuregross marginAI due diligenceLLMtokens

Inference cost has emerged as one of the most consequential metrics in AI company valuation and M&A due diligence. In the early period of large language model deployment (2022 to mid-2024), high inference cost was often treated as a temporary constraint that would fall as competition intensified among cloud providers. That expectation has partially materialised: per-token pricing on leading model APIs has declined significantly. However, the growth in token consumption per enterprise use case, driven by longer context windows and more complex multi-step agentic workflows, has meant that total inference cost as a percentage of revenue has not declined at the same rate for many AI application companies. Understanding inference cost dynamics is therefore not merely a technical detail; it determines whether an AI company’s gross margins are sustainable and whether the business model is structurally sound at scale.

How Inference Cost Is Measured

Inference cost is denominated in compute units consumed per forward pass through the model. For commercial API purposes, this is typically abstracted into token pricing, where a token is approximately four characters of English text for most current tokenisation schemes.

Input tokens are the tokens in the prompt sent to the model: the user query, the system prompt, the context retrieved from a knowledge base, and any tool output injected into the conversation. For agentic workflows with long context windows, input token counts can reach hundreds of thousands of tokens per interaction.

Output tokens are the tokens the model generates in response. Output tokens are consistently priced higher than input tokens by commercial API providers, reflecting the autoregressive generation cost of producing each token sequentially. Output tokens in a typical enterprise query might range from a few hundred for a classification task to several thousand for a document analysis response.

Latency is a secondary cost dimension: time-to-first-token and time-to-completion affect the user experience, and the trade-off between latency and cost (achieved through batching, speculative decoding, or smaller models) is an active operational decision for AI application companies.

A representative comparison of per-token pricing across major API providers as of 2025:

Provider / Model	Input (per 1M tokens)	Output (per 1M tokens)	Notes
GPT-4o (OpenAI)	~$2.50	~$10.00	Highest capability tier
GPT-4o mini	~$0.15	~$0.60	Cost-optimised tier
Claude 3.5 Sonnet	~$3.00	~$15.00	Strong reasoning performance
Claude 3 Haiku	~$0.25	~$1.25	Fast, cost-efficient tier
Llama 3 70B (self-hosted)	~$0.60–1.50	~$0.60–1.50	Depends on GPU provider pricing
Gemini 1.5 Flash	~$0.075	~$0.30	Cost-optimised long-context
Qwen (Alibaba Cloud, APAC)	Lower than US equivalents	Lower than US equivalents	Competitive pricing for APAC workloads

These figures are approximate and change frequently as providers compete on price. The more important variable for AI company financial modelling is the blended cost per unit of enterprise value delivered, not the raw per-token price, because different AI tasks have very different token consumption profiles.

Why Inference Cost Matters for AI Company Valuation

Inference cost affects AI company valuations through three distinct mechanisms.

Gross margin compression at scale. For an AI application company selling at a fixed price per seat or per transaction, inference cost is the primary COGS component. A company generating $10M ARR from 100 enterprise customers might have a comfortable gross margin at current usage levels. If those customers double their agent deployment intensity, doubling token consumption without a corresponding increase in seat price, gross margins decline proportionally. Investors and acquirers model this trajectory explicitly, and companies that cannot demonstrate a credible path to gross margin improvement as usage scales are valued at a discount to the enterprise software benchmark multiples that AI application companies otherwise command.

Unit economics differentiation between AI-native and AI-enhanced companies. An AI-native company whose core product is the model output (a legal AI that generates contract analysis, a clinical AI that produces diagnosis support) has inference cost baked directly into its product delivery cost. An AI-enhanced company that has added AI features to an existing software product may have a substantially lower inference cost as a percentage of revenue because the AI feature drives engagement without being the primary value delivery mechanism. Diligence processes that compare companies in the same AI application category must account for this structural difference.

Technology dependency risk. AI application companies that depend entirely on third-party API providers for inference are exposed to provider pricing changes, model discontinuation risk, and competitive threats from providers who also build application-layer products. Companies that have invested in their own inference infrastructure, whether through self-hosted models, fine-tuned smaller models, or inference optimisation capabilities, have a structurally superior position. This infrastructure investment is a valuation-positive signal even when the immediate cost savings are modest, because it signals the capability to control the most significant cost driver in the business.

Inference Cost in M&A Due Diligence

In AI company M&A diligence, inference cost analysis covers several specific workstreams.

Historical COGS decomposition. Acquirers will request a breakdown of COGS that isolates inference cost from other hosting, storage, and support costs. For companies using third-party APIs, this decomposition is relatively straightforward from cloud billing data. For companies with proprietary infrastructure, the allocation of GPU depreciation and cloud compute to specific product lines is a more complex accounting exercise that should be prepared before diligence begins.

Usage and cost cohort analysis. The relationship between customer ARR, usage volume, and inference cost should be analysed at the cohort level, not just the aggregate. A common finding in diligence is that the top 10–20% of customers by ARR are responsible for 50–70% of inference cost, because enterprise customers deploy agents at volumes that exceed what the pricing model was designed to accommodate. Acquirers will model whether these usage patterns are representative of the broader customer base as it matures, and whether the pricing model has been adjusted to reflect them.

Inference cost reduction roadmap credibility. Companies that have a credible plan to reduce inference cost through model distillation, fine-tuned smaller models for specific tasks, caching of repeated queries, or migration to lower-cost foundation model tiers should present this roadmap explicitly. The credibility of the roadmap depends on the technical team’s prior execution on similar optimisations and the specific tooling investments already in place.

Foundation model dependency assessment. For companies running on OpenAI, Anthropic, or Google API, acquirers will assess the switching cost and switching risk if the preferred provider changes pricing or deprecates a model version. Companies that have already validated alternatives or migrated portions of their workload to open-weight models have reduced this risk profile.

Inference Cost Management Techniques

AI companies have developed several approaches to managing inference cost as they scale.

Model distillation. A smaller, fine-tuned model is trained to replicate the outputs of a larger, more expensive model on a specific task distribution. A company whose core product involves a repetitive structured task, such as extracting specific fields from a standardised document, can often distill a high-performing small model that costs a fraction of the frontier model API price while maintaining accuracy within acceptable bounds.

Response caching. Semantically similar queries can often return cached responses without a new inference call. For AI products with predictable query patterns, such as FAQ-style customer service agents or standardised document analysis workflows, caching can reduce inference calls by 20–40% with no perceptible quality degradation.

Prompt engineering and context reduction. Reducing the number of input tokens without reducing the information required for accurate responses is a significant lever. Techniques include more precise retrieval-augmented generation (RAG) that passes only the relevant document passages rather than entire documents, compression of conversation history, and structured prompt templates that eliminate redundant context. See Context Window for background on how context window size interacts with inference cost.

Routing between model tiers. Some AI companies implement a routing layer that classifies queries by complexity and routes simpler queries to cheaper, faster models while reserving frontier models for tasks requiring maximum capability. A well-designed routing layer can reduce average inference cost by 30–50% with minimal impact on overall product quality.

Inference Cost in APAC: Regional Considerations

For AI companies operating in Asia Pacific, inference cost has several region-specific dimensions.

APAC cloud provider pricing. Alibaba Cloud, Baidu Cloud, Tencent Cloud, and Naver Cloud all offer foundation model inference at pricing structures that are competitive with US providers for APAC workloads and may be significantly cheaper for Chinese-language and Korean-language tasks, where these providers have invested more in model optimisation. For AI companies serving primarily APAC enterprise customers, the assumption that AWS or GCP inference pricing is the only relevant benchmark is incorrect.

Multilingual token efficiency. The token count for a given piece of information varies across languages. CJK scripts (Chinese, Japanese, Korean) are often tokenised at higher character-per-token ratios in models trained primarily on English corpora, which means that the same information content costs more to process in Japanese or Korean than in English when using models not optimised for those languages. Models fine-tuned for CJK languages, including Upstage Solar, Zhipu GLM, and Moonshot Kimi, have better tokenisation efficiency for these languages, which translates directly into lower per-query inference cost for APAC enterprise workloads.

Data residency requirements. In Japan, South Korea, Australia, and increasingly across Southeast Asia, regulated enterprises face data residency requirements that prevent them from sending sensitive document data to US-hosted inference endpoints. AI companies serving regulated APAC verticals must either maintain APAC-region inference infrastructure or use APAC-compliant cloud regions, which affects the cost structure and the set of providers available.

Three Due Diligence Questions for AI Acquirers

For acquirers evaluating AI application or AI infrastructure companies, inference cost deserves explicit investigation.

What is the fully loaded gross margin after all inference costs, and how does it trend as the largest customers increase usage? A company reporting a 70% gross margin that excludes inference cost has a meaningfully different risk profile from one reporting the same margin with inference cost allocated correctly. The margin trajectory as enterprise customers expand usage is more important than the point-in-time margin.

What is the inference cost exposure to any single model provider, and what would happen to the business if that provider raised prices by 50%? API pricing for frontier models has declined, but the risk of future price increases, model deprecation, or competitive conflict with the model provider’s own application products is real. Companies that have demonstrated migration capability or have diversified across providers have reduced this concentration risk.

What is the company’s roadmap for inference cost reduction, and what specific investments has it already made? Fine-tuned sub-models, caching infrastructure, routing layers, and self-hosting capabilities are specific evidence of technical execution on cost management. Founders who can demonstrate that they have already reduced inference cost per query by a measurable percentage through specific technical investments are materially more credible than those offering only a plan.

Inference cost is among the most frequently misunderstood financial metrics in AI company evaluation. Understanding it requires both technical context and financial modelling discipline. Amafi Advisory works with AI company founders and corporate acquirers to structure transactions where AI infrastructure economics are central to valuation — including diligence review of inference cost structure, gross margin sustainability, and technology dependency risk. Contact our team to discuss an advisory engagement.

Related terms: Context Window, Synthetic Data, Red-Teaming, Due Diligence, ARR