Raising the Bar for AI Assistant in Adobe Experience Platform

Winning approaches to evaluation and incident prevention.

In a previous blog post, we explored how AI Assistant in Adobe Experience Platform is monitored and improved via an end-to-end evaluation framework, including how we track, categorize, and learn from errors in real-world usage.

This sequel expands on those ideas to address the challenges that arise when AI Assistant encounters much larger, more varied user traffic, as detailed in our research paper, Evaluation and Incident Prevention in an Enterprise AI Assistant. We are thrilled to share that this latest work recently earned the prestigious DSRI AI Incidents and Best Practices Paper Award at the 39th Annual AAAI Conference on Artificial Intelligence. This recognition is a testament to our team’s commitment and engineering excellence in building enterprise-grade AI solutions that stand up to real-world complexity.

︎Some authors of the paper displaying the DSRI AI Incidents and Best Practices Paper Award won at the 39th Annual AAAI Conference on Artificial Intelligence.

︎Some authors of the paper displaying the DSRI AI Incidents and Best Practices Paper Award won at the 39th Annual AAAI Conference on Artificial Intelligence.

Smarter annotation through coreset sampling

When we first launched AI Assistant, we attempted to label nearly every question to identify opportunities to improve. That became unsustainable as our customer base grew drastically, with users asking thousands of questions each month. We needed a way to preserve a complete picture of system performance without having to triple or quadruple our annotation staff.

Our answer: coreset-based sampling.

In simple terms, each query and its corresponding answer is embedded into a high-dimensional vector space. From there, we pick a “coreset” (or minimal subset) of queries that collectively cover the major patterns and edge cases in the data. Coreset sampling makes one important assumption — that the error rate we are interested in is a linear function of a (learnable) embedding. With this assumption, finding a minimal representative sample becomes a weighted discrepancy minimization problem. This means that we must find the minimal set of data points whose embedding vectors best approximate the “average” embedding of the dataset.

We solve this problem by applying an algorithm called Greedy Iterative Geodesic Ascent (GIGA). GIGA essentially looks at how well each additional data point covers unexplored regions of the embedding space. By iteratively adding the “most unique” query each round, GIGA ensures we capture all major user behaviors and tricky corner cases, but without an explosion in labeling costs.

︎Figure 3 from the Mean squared error of uniform (random) sampling and covariate aware sampling using GIGA. Error is measured with respect to proportion estimates obtained over the full set of annotations.

︎Figure 3 from the Mean squared error of uniform (random) sampling and covariate aware sampling using GIGA. Error is measured with respect to proportion estimates obtained over the full set of annotations.

By comparing root mean squared error (RMSE) and other discrepancy metrics, we saw that coreset sampling outperforms random sampling. While a random sample might miss corner cases or over-represent redundant queries, coreset sampling selects fewer, more informative examples. This has freed us to devote expert annotator time to truly novel or complicated query/answer pairs.

︎Table 2 from the paper: Coreset size vs Unif size with percentage reduction in the number of samples needed for uniform sampling to reach the equivalent root mean squared error of the coreset based approach.

︎Table 2 from the paper: Coreset size vs Unif size with percentage reduction in the number of samples needed for uniform sampling to reach the equivalent root mean squared error of the coreset based approach.

As a result, even with thousands of questions flowing in every month, we can discover critical issues quickly without missing rare but high-impact problems. This keeps AI Assistant’s performance consistent and avoids forcing customers to uncover obscure failures on their own. At the same time, annotation becomes more scalable, freeing resources for developing new features rather than drowning in labeling tasks.

Hunting down failures with adversarial testing

Enterprise users benefit from fewer surprises and higher reliability in our system. By proactively uncovering worst-case scenarios before anyone else, we protect customers from encountering these blind spots in day-to-day use.

Human annotation is crucial for broad monitoring, but there’s a second technique that has proven invaluable: adversarial testing. Instead of waiting for real users to stumble on subtle failures, we invite internal domain experts — people with specialized knowledge — to intentionally “break” AI Assistant. They craft questions designed to push every known weak spot: tricky domain jargon, contradictory instructions, or references to extremely specialized documentation.

Because experts pinpoint the root cause right away, we know if a mishap arises from a missing data source, a misconfigured retrieval pipeline, or a language model hallucination. That insight is fed directly back into engineering sprints and data improvements, so each high-risk bug can be addressed at the source.

Preventing Regressions with Shared Evaluation Datasets

We’re constantly rolling out new features and capabilities — sometimes every other week — to meet diverse customer needs. Multiple teams handle different components, and this level of rapid development can raise the risk of “feature regressions,” where one improvement inadvertently disrupts another function.

Our latest approach is shared evaluation datasets. At regular intervals (once a quarter), we gather and annotate a new batch of real production queries. This batch is then “locked in,” split into a development set (for component development or testing new ideas) and a holdout set that’s used strictly for final evaluations. If a team wants to introduce a change, they must show that the new code produces at least as good results on the holdout set as the current production baseline.

︎Figure 4 from the paper: Creation of shared evaluation datasets on an ongo- ing basis, using sampling and human annotation of production traffic, which is then partitioned into development and holdout datasets.

︎Figure 4 from the paper: Creation of shared evaluation datasets on an ongo- ing basis, using sampling and human annotation of production traffic, which is then partitioned into development and holdout datasets.

Closing thoughts

We maintain a holistic view of interactions by monitoring how quickly users recover from mistakes, whether they receive clear explanations, and how effectively they complete their intended tasks. This bigger-picture lens drives user interface refinements and better conversation flows, ensuring the AI Assistant remains straightforward and transparent rather than misleading users with confident but incorrect answers.

︎Figure 5 from the paper: The continual improvement framework, emphasizing human annotation as a way of both generating labeled data to be used in shared evaluation datasets, and in driving measurement and error analysis. With the Error severity framework, we are then able to prioritize improved AI components, but also consider other improvements like UX changes that aid in verifiability, explainability, and enhancing user’s ability to recover.

︎Figure 5 from the paper: The continual improvement framework, emphasizing human annotation as a way of both generating labeled data to be used in shared evaluation datasets, and in driving measurement and error analysis. With the Error severity framework, we are then able to prioritize improved AI components, but also consider other improvements like UX changes that aid in verifiability, explainability, and enhancing user’s ability to recover.

Scaling up an AI Assistant is not just about spinning bigger servers or using more advanced models. It requires strategic, precise methods for monitoring and continuous improvement. We anticipate evolving these processes further as we tackle emerging challenges, like domain-specific compliance requirements and more advanced multimodal questions. Stay tuned for future updates where we dive deeper into how these evaluation and testing strategies equip AI Assistant to remain at the cutting edge of enterprise reliability.

Authors: Akash V. Maharaj, David Arbour, Daniel Lee, Uttaran Bhattacharya, Anup Rao, Austin Zane, Avi Feller, Kun Qian, and Yunyao Li

Namita Krishnan, Rini Iyju, Guang-jie Ren, and Huong Vu also contributed to this article.