Semantic Cache Response Pilot for AI using Sitecore

In modern enterprise environments, AI integrations (especially chat interfaces) are booming. However, every AI call adds latency, cost, and risk of repetition. Imagine if your system could recognize when it’s seen a question before, and serve an intelligent cached response instead?

That was the idea: build a semantic cache layer for Sitecore that avoids repeated AI calls for semantically similar queries, using ONNX embeddings and a similarity threshold.

This blog walks through our first working prototype, the system design, key implementation points, and what’s next.

ONNX for Embedding Generation

A critical part of this project is turning user queries into semantic vectors, numerical representations of meaning. To do this efficiently and reliably, we selected ONNX (Open Neural Network Exchange) as our embedding model format.

Why ONNX?

Performance: ONNX models are optimized for fast inference, ideal for real-time caching systems.

Compatibility: It works seamlessly within .NET applications using libraries like Microsoft.ML.OnnxRuntime.

Offline Capability: Once exported, the ONNX model runs locally, no need for internet or external API calls.

This makes ONNX the perfect engine for turning a user question into a numeric fingerprint you can compare and reuse.

System Architecture Overview

At a high level, this solution does the following:

Receives a user query.
Generates an embedding vector using ONNX model.
Compares it against embeddings in the cache.
If similar enough (configurable threshold), returns the cached answer.
Otherwise, queries AI, caches it, and returns.

Configuration Parameters

<setting name="AICache.ModelPath" value="/App_Data/models/model.onnx" />
<setting name="SemanticCache.SimilarityThreshold" value="0.80" />
<setting name="SemanticCache.ExpirationHours" value="4" />
<setting name="SemanticCache.MaxEntries" value="1000" />

Development Details

The solution was built in C# within the Sitecore platform. Key components:

OnnxEmbeddingService.cs

Responsible for embedding queries into vector space using ONNX.

float[] embedding = _embeddingService.GenerateEmbedding(query);

SemanticCacheService.cs

The heart of the system: finds matches or stores new responses.

public SysResponse GetResponse(string query, Func<string, string> aiService)

It Uses cosine similarity

dot / (sqrt(magA) * sqrt(magB))

It uses in-memory ConcurrentDictionary plus SitecoreAICache

SitecoreAICache.cs

Custom cache extending Sitecore.CustomCache

public void Add(string key, object value, TimeSpan expiration)

How to Download and Set Up the ONNX Model

Modelo ONNX: model.onnx
Vocabulario: vocab.txt

If not present, add both files to your published solution in App_Data/models/

Use Case: Matching Similar Questions

Query 1: “How do I reset my password?”
Query 2: “What’s the way to change my login credentials?”

Query 1: “How do I reset my password?”

Query 2: “What’s the way to change my login credentials?”

Since embeddings are vector-based, the system recognizes that these queries are semantically similar and reuses the response if similarity exceeds 0.80. In this test the similarity is 0.844.

Result:

Cache HIT! ‘What’s the way to change my login credentials?’ matches ‘How do I reset my password?’ The response is served from Cache and it is the same for both queries.

Known Limitations

Similarity Threshold is still heuristic: 0.80 works for most, but not all.
Keyword-based similarity not fully used.
Embedding vector sizes must match; malformed embeddings cause warnings.

Future Enhancements

Fine-tune similarity threshold per category of questions
Add keyword based comparison
Use persistent storage or Redis for cache
Export analytics of common queries

Project Repository

Code: https://github.com/gabrielbaldeon/IASemanticCacheResponse

Conclusion

This working prototype shows real-world semantic caching in action, reducing unnecessary AI calls and boosting responsiveness. It’s early, but the concept proves valuable.

If you’re looking to bring AI into your enterprise stack efficiently, consider semantic caching as your first optimization layer.