Show, Don’t Tell: Multimodal Answers in AI Assistant in Adobe Experience Platform

Picture

Imagine you have just built a new audience to promote your summer collection, and now you want to activate it to an advertising destination. You are unsure which settings to configure, whether the audience is eligible, or how to confirm if the activation went through correctly. Instead of jumping between scattered documentation, you ask AI Assistant in Adobe Experience Platform. It responds with a cohesive, step-by-step answer, including a screenshot highlighting the exact button to click and a 40‑second video that walks you through the full setup.

That’s the power of multimodal answers in AI Assistant in Adobe Experience Platform and it can change how thousands of users learn, troubleshoot, and work in Adobe Experience Platform. In this blog, we introduce you to the engineering breakthrough that powers these experiences, as outlined in the paper, MuRAR: A Simple and Effective Multimodal Retrieval and Answer Refinement Framework for Multimodal Question Answering. This same technology powers the new Product Support Agent, designed to give practitioners access to a host of troubleshooting and learning resources and materials to help them realize the value of Experience Platform quicker.

The core research behind this feature began as part of a summer internship project at Adobe and is a great example of how interns contribute to high-impact, forward-looking work that directly shapes product innovation.

🎉 We are thrilled to share that this work earned the award for Best Demonstration Paper at Coling Conference 2025.

🚩 Where Enterprise Users Struggle

In enterprise environments, users often have to complete domain-specific workflows that involve multiple steps, specialized UI configurations, and product-specific terminology. These tasks are difficult to complete without integrated, contextual guidance, something that is hard to deliver through static documentation alone.

Existing solutions fall short because they either generate plain text responses from multimodal content or append a single image or video to a text answer without integrating it meaningfully. They fail to address complex, goal-driven questions that require stitching together multiple content types - text, screenshots, video, and UI context into a coherent, actionable response.

💡 Our Approach

Our approach addresses two practical challenges; first, locating the most relevant image, table, or video needed to substantiate the answer and help the user; second, weaving those assets into a single, readable narrative.

A diagram of a software system AI-generated content may be incorrect., Picture

The architecture of the MuRAR framework.

To solve the retrieval problem, the text answer is divided into individual sentences. Each sentence is embedded and compared, using cosine similarity, with every text document snippet in the corpus. The search scope is then narrowed to the section where the source appears. All images, tables, and videos found in that section are represented using text features by combining nearby paragraph context along with LLM generated captions or summaries. Their embeddings are ranked by similarity to the answer sentence, and the top‑scoring asset is selected. If the same asset surfaces for multiple sentences, only its highest‑scoring instance is kept to avoid duplication.

For answer refinement, placeholders for the multimodal content are inserted exactly where the context will help the reader. Each placeholder includes the URL of the multimodal data and its contextual text features, ensuring the LLM incorporates relevant information while minimizing the risk of generating irrelevant details and hallucinations. A subsequent prompt instructs the LLM to rewrite the draft around those placeholders, and a post‑processor swaps placeholders for markdown elements. The result is a cohesive response that reads like a mini‑tutorial rather than a dump of assets.

From the user’s perspective, the AI response arrives as a single, flowing block, concise explanation interspersed with screenshots, or a short clip illustrating the steps just described. Users can interact with the multimodal elements and view them in detail in a pop-up window.

Picture 308353358, Picture

The interface of AI Assistant in Adobe Experience Platform demonstrating multimodal answers is constructed by combining multimodal data retrieval and answer refinement.

Meet the human behind the AI

Rather than taking our word for how this works, let’s hear straight from the source. Sai Sree Harsha, the machine‑learning engineer who led MuRAR from research to production, sat down for a five‑minute chat about his work.

https://www.youtube.com/watch?v=QlXAScp8VuE

And here is the transcript of their chat:

Q: Tell us about yourself and what you work on at Adobe.

A: Hi everyone. I'm Sai Sree Harsha. I'm a machine learning engineer at Adobe.  I work on AI Assistant in Adobe Experience Platform, focusing mostly on the product knowledge capability. I have worked on various components of the pipeline, including document retrieval, prompt optimization, and the evaluation framework. More recently, I've been leading the development of MuRAR, a multimodal retrieval and answer refinement framework, which delivers rich image and video enhanced responses in AI Assistant.

Q: What customer pain points pushed you to build MuRAR?

A: The biggest pain point that we noticed was that our users would often struggle to connect AI Assistant responses with actual UI workflows.  They would read these long responses, but they would still not know where to click on the UI and what to expect. To address this, MuRAR actually embeds screenshots, videos, and architecture diagrams within the AI Assistant responses, turning these explanations into step-by-step mini tutorials.

Q: Tell us how this works.

A: Sure! The MuRAR framework works in two main stages, multimodal retrieval and answer refinement. We start with the standard Retrieval Augmented Generation setup. When a user asks a question, we first retrieve relevant text bits from our document corpus. These are then passed to an LLM, which generates a draft text answer just like a typical RAG pipeline.

Next comes the multimodal retrieval step. For each sentence in the draft answer, we trace back to the source that it likely came from. Then we look for relevant images or video clips in those same sections of the documents. Now finding the right visual isn't just about matching file names. We generate an embedding for each asset using a combination of nearby paragraph context, auto-generated image captions, and transcripts for videos.

We then match those against the text answer that we generated in the first step to find the most relevant multimodal assets. Once we've picked the best matches, we insert placeholders into the draft response where each image or video would be most helpful. We pass the draft response and the retrieved multimodal assets through the LLM again which rewrites the answer around these visuals to result in a cohesive response. The final result is an answer that feels both natural and helpful with screenshots or clips appearing exactly where the user needs them.

Q: How did that architecture translate into real user value?

A: Many internal users immediately found that the responses were more actionable and intuitive. Instead of needing to leave the chat or open multiple tabs, they got everything that they needed right there. It reduces confusion, speeds up onboarding and troubleshooting, and also improves user engagement.

Q: What’s coming next?

A: There are some exciting things that are coming next for MuRAR.  Right now, the system prioritizes precision. That is, we only surface images and videos when we are highly confident in their relevance. So as a result, not every response includes multimodal assets. We are now focused on improving recall, so more answers are enhanced with helpful visuals.

We are also adding deep link support for video timestamps, allowing users to jump directly to the exact moment in a video that addresses their questions.

Q: What is the best part of your role?

A: I really enjoy the balance between research, exploration, and real-world impact.  I get to prototype new ideas, build and deploy them, and see them directly improve user experience. Working with a team that's just as excited about applied AI makes it even better.

Q: Do you have any advice for engineers who want to work on Gen AI at Adobe?

A: For engineers who want to work on Gen AI, I would say the most important thing is to develop a strong understanding of the user problem that you're solving. Gen AI is a powerful tool, but it's only effective when applied in the right context with the right constraints. At Adobe, there's a strong emphasis on responsible AI and rigorous evaluation and quality improvement. So, if you enjoy solving hard problems which have real end-user impact, I think this is a great place to be!

Learn more and try it today

To learn more about our work and the impact we’re seeing, read the full paper here and follow Adobe Engineering on LinkedIn for updates on our latest innovations.

Start using AI Assistant in Adobe Experience Platform today and supercharge the productivity of your marketing teams. AI Assistant is now available in Real-Time CDP, Journey Optimizer, and Customer Journey Analytics! For more details on getting access, visit the Access AI Assistant in Experience Platform page.

If building Gen AI at enterprise scale excites you, explore open roles on the Adobe careers site.

Authors: Zhengyuan Zhu, Daniel Lee, Hong Zhang, Sai Sree Harsha, Loic Feujio, Akash Maharaj, Yunyao Li

Namita Krishnan, Guang-jie Ren, Shreya Anantha Raman, and Huong Vu also contributed to this article.