LifelongAgentBench:
Evaluating LLM Agents as Lifelong Learners

Junhao Zheng^1*, Xidi Cai^1*, Qiuke Li¹, Duzhen Zhang², Zhong-Zhi Li³, Yingying Zhang⁴, Le Song², Qianli Ma^1†

¹ South China University of Technology, ²MBZUAI,
³Chinese Academy of Sciences, ⁴East China Normal University

Paper Code arXiv

LifelongAgentBench is a unified benchmark and evaluation framework designed to systematically assess the lifelong learning capabilities of LLM-based agents. While lifelong learning is essential for intelligent agents operating in dynamic environments, current LLM agents remain stateless and struggle to accumulate or transfer knowledge over time. Existing benchmarks treat agents as static systems, lacking the means to evaluate continual learning. LifelongAgentBench addresses this gap by offering a skill-grounded task suite with interdependent challenges across three interactive environments—Database, Operating System, and Knowledge Graph. It features four key properties: task dependency, label verifiability, reproducibility, and modularity.

Partial Skill Distribution Examples from the Dataset
LifelongAgentBench features three environments—Database, Operating System, and Knowledge Graph—each with diverse skill sets to assess LLM agents’ lifelong learning. The Database environment includes 22 SQL skills such as filtering, grouping, and nested queries. The Operating System environment covers 29 Bash skills like file operations, user management, and text processing. The Knowledge Graph environment focuses on SPARQL tasks involving relation extraction and multi-step reasoning.

Experiments show that conventional experience replay underperforms due to context limits and irrelevant data.
To overcome this, a group self-consistency mechanism is introduced to enhance lifelong learning.
Group Self-Consistency is a lightweight and scalable strategy to improve lifelong learning in LLM agents. It addresses the challenges of memory and inference overhead caused by experience replay by partitioning retrieved experiences into smaller groups. Each group is processed independently, and final predictions are aggregated through self-consistency voting. This approach significantly reduces input length while preserving prediction accuracy, leading to more stable and efficient learning across diverse environments. Its simplicity and generalizability make it a promising solution for memory-constrained lifelong learning scenarios.

Comparison of the accuracy (average input tokens) under different group self-consistency settings.

Comparison between LifelongAgentBench and existing benchmarks. †: highlights label error issues in WebArena.
LifelongAgentBench is the first unified benchmark specifically designed to evaluate lifelong learning in LLM-based agents across realistic and diverse environments. It addresses critical limitations in prior benchmarks, such as lack of task dependency modeling, label verifiability, and reproducibility. The benchmark features three interactive environments—Database, Operating System, and Knowledge Graph—that assess agents’ abilities to acquire, transfer, and retain skills over long task sequences.

BibTeX

@misc{zheng2025lifelongagentbenchevaluatingllmagents,
  title={LifelongAgentBench: Evaluating LLM Agents as Lifelong Learners},
  author={Junhao Zheng and Xidi Cai and Qiuke Li and Duzhen Zhang and ZhongZhi Li and Yingying Zhang and Le Song and Qianli Ma},
  year={2025},
  eprint={2505.11942},
  archivePrefix={arXiv},
  primaryClass={cs.AI},
  url={https://arxiv.org/abs/2505.11942},
}

LifelongAgentBench:
Evaluating LLM Agents as Lifelong Learners

Datasets Design

Group Self-Consistency

Comparison of the accuracy (average input tokens) under different group self-consistency settings.

Related Work

BibTeX