MemRL outperforms RAG on complex agent benchmarks without finetuning
Jan 22, 2026
A new technique developed by researchers at Shanghai Jiao Tong University and other institutions enables large language model agents to learn new skills without the need for expensive fine-tuning.The researchers propose MemRL, a framework that gives agents the ability to develop episodic memory, the
capacity to retrieve past experiences to create solutions for unseen tasks. MemRL allows agents to use environmental feedback to refine their problem-solving strategies continuously.MemRL is part of a broader push in the research community to develop continual learning capabilities for AI applications. In experiments on key industry benchmarks, the framework outperformed other baselines such as RAG and other memory organization techniques, particularly in complex environments that require exploration and experiments. This suggests MemRL could become a critical component for building AI applications that must operate in dynamic real-world settings where requirements and tasks constantly shift.The stability-plasticity dilemmaOne of the central challenges in deploying agentic applications is adapting the underlying model to new knowledge and tasks after the initial training phase. Current approaches generally fall into two categories: parametric approaches, such as fine-tuning, and non-parametric approaches, such as RAG. But both come with significant trade-offs.Fine-tuning, while effective for baking in new information, is computationally expensive and slow. More critically, it often leads to catastrophic forgetting, a phenomenon where newly acquired knowledge overwrites previously learned data, degrading the model's general performance.Conversely, non-parametric methods like RAG are fundamentally passive; they retrieve information based solely on semantic similarity, such as vector embeddings, without evaluating the actual utility of the information to the input query. This approach assumes that "similar implies useful," which is often flawed in complex reasoning tasks. The researchers argue that human intelligence solves this problem by maintaining “the delicate balance between the stability of cognitive reasoning and the plasticity of episodic memory.” In the human brain, stable reasoning (associated with the cortex) is decoupled from dynamic episodic memory. This allows humans to adapt to new tasks without "rewiring neural circuitry" (the rough equivalent of model fine-tuning).Inside the MemRL frameworkInspired by humans’ use of episodic memory and cognitive reasoning, MemRL is designed to enable an agent to continuously improve its performance after deployment without compromising the stability of its backbone LLM. Instead of changing the model’s parameters, the framework shifts the adaptation mechanism to an external, self-evolving memory structure.In this architecture, the LLM's parameters remain completely frozen. The model acts effectively as the "cortex," responsible for general reasoning, logic, and code generation, but it is not responsible for storing specific successes or failures encountered after deployment. This structure ensures stable cognitive reasoning and prevents catastrophic forgetting.To handle adaptation, MemRL maintains a dynamic episodic memory component. Instead of storing plain text documents and static embedding values, as is common in RAG, MemRL organizes memory into "intent-experience-utility" triplets. These contain the user's query (the intent), the specific solution trajectory or action taken (the experience), and a score, known as the Q-value, that represents how successful this specific experience was in the past (the utility).Crucially for enterprise architects, this new data structure doesn't require ripping out existing infrastructure. "MemRL is designed to be a 'drop-in' replacement for the retrieval layer in existing technology stacks and is compatible with various vector databases," Muning Wen, a co-author of the paper and PhD candidate at Shanghai Jiao Tong University, told VentureBeat. "The existence and updating of 'Q-Value' is solely for better evaluation and management of dynamic data... and is independent of the storage format."This utility score is the key differentiator from classic RAG systems. At inference time, MemRL agents employ a "two-phase retrieval" mechanism. First, the system identifies memories that are semantically close to the query to ensure relevance. It then re-ranks these candidates based on their Q-value, effectively prioritizing proven strategies.The framework incorporates reinforcement learning directly into the memory retrieval process. When an agent attempts a solution and receives environmental feedback (i.e., success or failure) it updates the Q-value of the retrieved memory. This creates a closed feedback loop: over time, the agent learns to ignore distractor memories and prioritize high-value strategies without ever needing to retrain the underlying LLM.While adding a reinforcement learning step might sound like it adds significant latency, Wen noted that the computational overhead is minimal. "Our Q-value calculation is performed entirely on the CPU," he said.MemRL also possesses runtime continual learning capabilities. When the agent encounters a new scenario, the system uses the frozen LLM to summarize the new trajectory and adds it to the memory bank as a new triplet. This allows the agent to expand its knowledge base dynamically as it interacts with the world.It is worth noting that the automation of the value assignment comes with a risk: If the system mistakenly validates a bad interaction, the agent could learn the wrong lesson. Wen acknowledges this "poisoned memory" risk but notes that unlike black-box neural networks, MemRL remains transparent and auditable. "If a bad interaction is mistakenly classified as a positive example... it may spread more widely," Wen said. "However … we can easily fix it by removing the contaminated data from the memory bank or resetting their Q-values."MemRL in actionThe researchers evaluated MemRL against several baselines on four diverse industry benchmarks: BigCodeBench (code generation), ALFWorld (embodied navigation), Lifelong Agent Bench (OS and database interaction), and Humanity's Last Exam (complex multidisciplinary reasoning). The results showed that MemRL consistently outperformed baselines in both runtime learning (improving during the session) and transfer learning (generalizing to unseen tasks).The advantages of this value-aware retrieval mechanism were most pronounced in exploration-heavy environments like ALFWorld. In this benchmark, which requires agents to navigate and interact with a simulated household environment, MemRL achieved a relative improvement of approximately 56% over MemP, another agentic memory framework. The researchers found that the reinforcement learning component effectively encouraged the agent to explore and discover solutions for complex tasks that similarity-based retrieval methods often failed to solve.When the memory bank was frozen and tested on held-out sets to measure generalization, MemRL achieved the highest accuracy across benchmarks. For example, on the Lifelong Agent Bench, it improved significantly upon the standard RAG baseline on OS tasks. This indicates that the system does not merely memorize training data but effectively filters out low-value memories to retain high-utility experiences that generalize to new situations.The broader picture for self-evolving agentsMemRL fits within a growing body of research focused on Memory-Based Markov Decision Processes (M-MDP), a formulation that frames memory retrieval as an active decision-making step rather than a passive search function. By treating retrieval as an action that can be optimized via reinforcement learning, frameworks like MemRL and similar approaches such as Memento are paving the way for more autonomous systems. For enterprise AI, this shift is significant. It suggests a future where agents can be deployed with a general-purpose LLM and then rapidly adapt to specific company workflows, proprietary databases, and unique problem sets through interaction alone. The key shift we’re seeing is frameworks that are treating applications as dynamic environments that they can learn from.These emerging capabilities will allow organizations to maintain consistent, high-performance agents that evolve alongside their business needs, solving the problem of stale models without incurring the prohibitive costs of constant retraining.It marks a transition in how we value data. "In a future where static data is about to be exhausted, the interaction experience generated by each intelligent agent during its lifespan will become the new fuel," Wen said.
...read more
read less