Home / Glossary / Glossary

Embeddings

Embeddings are dense numerical vector representations of text, images, code, or other discrete data produced by a trained AI model, in which objects with similar meaning or properties are mapped to nearby positions in a high-dimensional mathematical space. They are the foundational computational structure underlying semantic search, retrieval-augmented generation, recommendation systems, and most enterprise AI applications built since 2022.

embeddingsvector representationssemantic searchAI infrastructureAI due diligenceAI company valuationvector databaseRAGAPAC AI

Embeddings are not outputs in the way that a language model’s response is an output. They are intermediate representations, computed inside a model and extracted for use elsewhere. When a user searches a document repository, the query is converted to an embedding vector, each document is stored as an embedding vector, and the system retrieves documents whose vectors are closest to the query vector in embedding space. When a company builds a recommendation engine, every user and every product is an embedding, and the recommendation logic operates on the geometric relationships between those vectors. This indirect, geometry-based computation is what makes embeddings powerful: operations that would require hand-written rules in traditional software become learnable patterns in embedding space.

For AI company M&A due diligence, embeddings matter for two reasons. First, the quality of an AI product’s embedding layer is often the primary determinant of how well it works, which makes it a central subject in technical due diligence. Second, the provenance and defensibility of a company’s embeddings are the most reliable signal of whether its AI represents a genuine data moat or a thin wrapper over a commoditised foundation model API.

How Embeddings Are Produced

An embedding model takes an input object and outputs a fixed-length vector of floating-point numbers. The dimensionality of the vector varies by model: 384 dimensions is common for lightweight models, 1,536 dimensions for OpenAI’s text-embedding-ada-002, and 4,096 dimensions for larger proprietary models. Higher dimensionality generally allows more nuanced semantic representation but requires more storage and computation for the downstream similarity search.

The training process determines what geometric relationships the embedding space captures. A model trained to predict whether two sentences mean the same thing will produce an embedding space where semantically similar sentences cluster together regardless of surface-level wording differences. A model trained on code will produce an embedding space where functionally equivalent code snippets cluster together regardless of programming language or variable naming.

Three categories of embedding model matter commercially in 2025 and 2026:

General-purpose text embedding models. OpenAI’s text-embedding-3-large, Cohere Embed v3, Google’s text-embedding-004, and Voyage AI’s embedding models are the most widely used. They are available via API and are trained on large English-first multilingual corpora. These models are the standard starting point for new AI applications.

Specialized domain embedding models. Models trained on domain-specific corpora — legal contracts, medical literature, financial statements, code repositories, scientific papers — produce embedding spaces that are meaningfully better for domain-specific retrieval than general-purpose models. A legal AI company that has trained its own embedding model on millions of contracts will outperform a competitor using the OpenAI API on the specific task of finding relevant contract clauses. This performance difference is measurable, monetizable, and defensible, which is why specialised embedding models are a significant component of AI company IP.

Multimodal embedding models. Models that embed different data types into a shared vector space allow cross-modal retrieval: a text query can find relevant images, or an image can retrieve related documents. OpenAI’s CLIP-based models and Google’s multimodal embeddings are the primary general-purpose variants. Specialised multimodal models are emerging in manufacturing (visual inspection combined with maintenance text records), healthcare (imaging combined with clinical notes), and e-commerce (product images combined with attribute descriptions).

Embeddings and Data Moats in AI M&A

The distinction between general-purpose embeddings and proprietary embeddings is the most important valuation consideration in any AI company due diligence involving retrieval-based or recommendation-based AI products.

A company that indexes its content using the OpenAI embeddings API and retrieves via Pinecone, Weaviate, or a similar vector database has built a working product, but the embeddings themselves are not proprietary. Any competitor with the same underlying content could achieve approximately the same retrieval quality by using the same API. The moat is in the content, not the embeddings.

A company that has fine-tuned an embedding model on its proprietary content, user interactions, and domain-specific labels has built something more defensible. The fine-tuned model encodes not just the content but the patterns of what queries the company’s users actually ask, which concepts the company’s customers treat as related, and how the company’s specific domain maps onto semantic space. This fine-tuned embedding model is proprietary IP. The moat is in the model, the training data, and the accumulated fine-tuning signal, all of which are difficult to replicate without the same historical interaction data.

Acquirers applying standard technical due diligence in 2025 and 2026 will probe this distinction explicitly. The key questions are whether the company’s embedding model is a third-party API call or a proprietary artifact, whether the fine-tuning data is retained and owned by the company, and whether the embedding model’s performance advantage over general-purpose alternatives is measurable and documented.

Embeddings in Retrieval-Augmented Generation

The most common use of embeddings in enterprise AI products is as the retrieval layer in a RAG architecture. In a RAG system, documents are split into chunks, each chunk is converted to an embedding vector and stored in a vector database, and incoming queries are embedded and matched against the stored vectors to retrieve relevant context before a language model generates a response.

The embedding layer in a RAG architecture determines how accurately the system retrieves relevant context. Poor embedding quality, either because the general-purpose embedding model does not capture domain-specific semantics or because the chunking strategy loses meaningful context, is the most common cause of RAG system failures in enterprise deployment. An AI product that retrieves irrelevant context will generate confidently wrong answers regardless of the quality of the language model it uses.

For M&A due diligence on a RAG-based AI product, the technical review should cover: which embedding model is used (API or proprietary), how retrieval accuracy is measured (recall at k, mean average precision, or similar), whether the company has a retrieval evaluation dataset, and how retrieval performance has improved over the company’s history. A company that can show a clear retrieval improvement trajectory tied to embedding model investment is demonstrating a disciplined technical program that deserves a higher valuation multiple.

Acquirer Due Diligence Questions on Embeddings

Buyers conducting technical due diligence on an AI company whose product relies on embeddings should expect to probe five areas:

1. Embedding model provenance. Is the embedding model a third-party API, an open-source model deployed and maintained by the company, or a model the company has trained or fine-tuned on its own data? Each position represents a different risk profile and defensibility level.

2. Fine-tuning dataset ownership. If the company has fine-tuned an embedding model, who owns the training data, and is that ownership documented? User interaction logs used for fine-tuning may have data licensing implications, particularly for APAC companies operating under PDPA (Singapore), PIPL (China), APPI (Japan), or PIPA (Korea) privacy regimes.

3. Retrieval performance benchmarking. Does the company measure retrieval quality, and can it produce a historical record of retrieval accuracy? Companies that track and improve retrieval performance systematically are more operationally mature and justify a higher multiple than companies that rely on product team intuition.

4. Vector database dependency. Which vector database does the company use — Pinecone, Weaviate, Qdrant, Chroma, pgvector, or a cloud-native offering — and what is the lock-in level? Proprietary embedding models deployed against a commodity vector database are more defensible than general-purpose embeddings against a tightly integrated proprietary vector store.

5. Inference cost per embedding generation. For products that generate embeddings at high volume (real-time document indexing, high-throughput recommendation systems), the cost per embedding generation is a direct component of COGS. Acquirers should verify whether this cost has declined as the company has scaled, which indicates active cost management, or whether it has grown proportionally with volume, which indicates a potential gross margin headwind at scale.

Embeddings in APAC AI Products

APAC AI products face a distinct challenge with embeddings that does not apply with the same severity in US or European markets: the need for high-quality multilingual and non-English embeddings. General-purpose English embedding models trained primarily on English text perform significantly worse on Japanese, Korean, Traditional Chinese, Simplified Chinese, Thai, Vietnamese, and other APAC languages, for reasons that include both data representation in training sets and the structural differences between East Asian and European languages.

APAC-native embedding models have emerged to address this gap. BAAI’s bge-m3 model (China) provides strong multilingual coverage. CohereEmbedv3 and Voyage AI have expanded their multilingual training datasets. Japanese-specific models such as those released by ELYZA and cy-berplus have been evaluated for Japanese legal and financial text. Korean models from NAVER (Clova Embedding) cover Korean-language enterprise use cases.

For an APAC AI company whose product relies on accurate retrieval in Japanese, Korean, or Chinese, the quality of the multilingual embedding model is a key technical differentiator. A company that has invested in fine-tuning its embedding model on APAC-language domain content has a more defensible product than a competitor using a general-purpose multilingual model, particularly in verticals such as legal AI, financial AI, healthcare AI, and government AI where retrieval accuracy in the local language is a procurement requirement.

Amafi Advisory works with APAC AI companies at all stages of the M&A and fundraising lifecycle. If you are evaluating an AI company’s technical defensibility, or preparing your own AI company for a transaction, our team can advise on embedding strategy as part of the broader technical due diligence process. Get in touch.