Department of Industrial Engineering & Decision Analytics [Joint IEDA/ISOM] seminar - Near-Optimal Last-Iterate Convergence for Zero-Sum Games with Bandit Feedback and Opponent Actions
Supporting the below United Nations Sustainable Development Goals:支持以下聯合國可持續發展目標:支持以下联合国可持续发展目标:
Last-iterate convergence of learning dynamics in games has attracted significant recent attention. In two-player zero-sum games with bandit feedback, where only the loss of the selected action pair is observed, Fiegel et al. [2025] show a separation between average-iterate and last-iterate convergence in duality gap: while the optimal t^{-1/2} rate after t rounds is achievable for the former via standard no-regret algorithms, the latter cannot converge faster than t^{-1/3} in expectation or t^{-1/4} with high probability. However, in many practical settings (such as preference learning), the players observe not only their loss but also the opponent’s action. This raises a natural question: can such additional information enable faster last-iterate convergence? We answer this question affirmatively, showing that t^{-1/2} last-iterate convergence is achievable with high probability in this setting, via an efficient algorithm that updates its strategy infrequently by solving an estimated log-barrier-regularized game. We identify fundamental obstacles preventing standard analysis for multi-armed bandits (the single-player case) from generalizing to games, and develop a novel analysis to overcome them. Experiments confirm that our algorithm indeed converges faster than naive baselines and prior methods that do not exploit opponent-action feedback. Finally, we note that our results also improve those for dueling bandits, a special case with skew-symmetric game matrices.
Mengxiao Zhang is an assistant professor at the Business Analytics Department at University of Iowa. He obtained a PhD degree in Computer Science at University of Southern California advised by Prof. Haipeng Luo. His research is about designing robust and adaptive machine learning algorithms with strong theoretical guarantees, with a focus on general sequential learning problems, including online learning, bandit problems, game theory and various operational research and revenue management applications. He has published papers in several machine learning conferences and learning theory conferences, including oral and spotlight presentations. He has interned with Microsoft Research and Amazon, and received a B.S. in the School of EECS from Peking University.