The Power of Duality Principle in Offline Average-Reward Reinforcement Learning

Abstract

Offline reinforcement learning (RL) is widely used to find an optimal policy using a precollected dataset, without further interaction with the environment. Recent RL theory has made significant progress in developing sample-efficient offline RL algorithms with various relaxed assumptions on data coverage, with specific focuses on either infinite-horizon discounted or finite-horizon episodic Markov decision processes (MDPs). In this work, we revisit the LP framework and the induced duality principle for offline RL, specifically for infinite-horizon averagereward MDPs. By virtue of this LP formulation and the duality principle, our result achieves the $tilde{O}(1/sqrt{n})$ near-optimal rate under partial data coverage assumptions. Our key enabler is to relax the equality constraint and introduce proper new inequality constraints in the dual formulation of the LP. We hope our insights can shed new lights on the use of LP formulations and the induced duality principle, in offline RL.

Publication
International Conference on Machine Learning Workshop on Duality for Modern Machine Learning (ICML 2023 Workshop)