JACN 2024 Vol.12(1): 1-7
doi: 10.18178/jacn.2024.12.1.288
Dynamic Adjustment of Exploration Rate for PPO Algorithm to Relief Rapid Decline of Episode Rewards
Shingchern D. You 1,
Chao-Wei Ku 1,2,
and
Chien-Hung Liu1
1.
Department of Computer Science and Information Engineering, National Taipei University of Technology, Taipei, Taiwan
2.
Taiwan Semiconductor Manufactory Co., Taiwan
Email: scyou@ntut.edu.tw (S.D.Y.); weichaoku@gmail.com (C.W.K.); cliu@ntut.edu.tw (C.H.L.)
*Corresponding author
Manuscript received May 20, 2023, revised July 27, 2023; accepted August 30, 2023; published February 2, 2024.
Abstract—Proximal Policy Optimization (PPO) is a widely used algorithm in reinforcement learning. We observe that the agent may repeatedly select actions in a fixed sequence in some environments, leading to rapid decline of rewards. After that, the reward remains very low for a prolonged training period. Consequently, the training efficiency reduces. In this paper, we propose an approach to dynamically adjust the constant in the entropy term in the objective function of the PPO algorithm to encourage the agent to explore. Our experimental results show that the proposed algorithm is effective to relief this detrimental situation of rapid decline of episode rewards.
Keywords—entropy, Proximal Policy Optimization (PPO), exploration rate, reinforcement learning
[PDF]
Cite: Shingchern D. You, Chao-Wei Ku, and Chien-Hung Liu, "Dynamic Adjustment of Exploration Rate for PPO Algorithm to Relief Rapid Decline of Episode Rewards," Journal of Advances in Computer Networks vol. 12, no. 1, pp. 1-7, 2024.
Copyright © 2024 by the authors. This is an open access article distributed under the Creative Commons Attribution License which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited (
CC BY 4.0).