TY - GEN
T1 - Improving Proximal Policy Optimization Algorithm in Interactive Multi-Agent Systems
AU - Shang, Yi
AU - Chen, Yifei
AU - Cruz, Francisco
N1 - Publisher Copyright:
© 2024 IEEE.
PY - 2024
Y1 - 2024
N2 - Proximal Policy Optimization (PPO), as an outstanding Reinforcement learning (RL) algorithm, has proven its efficiency when solving a wide range of problems. Compared to other reinforcement learning algorithms, it has the advantage of advanced stability and reliability. However, as an on-policy algorithm, it suffers from the problem of sample inefficiency and moderate training speed. In this paper, we utilize two methods, namely, share parameter and share trajectory to speed up the training process of the PPO algorithm. Moreover, we introduce a method that uses the adaptive blending concept to prevent unnecessary updates during the parameter-sharing process. We also introduce the technique of possibility for selection, along with the thresholding method to balance the exploitation and exploration when incorporating the trajectory-sharing method. Tests performed under a multi-agent environment setup show both methods converge significantly faster in comparison to the training process of the traditional PPO algorithm.
AB - Proximal Policy Optimization (PPO), as an outstanding Reinforcement learning (RL) algorithm, has proven its efficiency when solving a wide range of problems. Compared to other reinforcement learning algorithms, it has the advantage of advanced stability and reliability. However, as an on-policy algorithm, it suffers from the problem of sample inefficiency and moderate training speed. In this paper, we utilize two methods, namely, share parameter and share trajectory to speed up the training process of the PPO algorithm. Moreover, we introduce a method that uses the adaptive blending concept to prevent unnecessary updates during the parameter-sharing process. We also introduce the technique of possibility for selection, along with the thresholding method to balance the exploitation and exploration when incorporating the trajectory-sharing method. Tests performed under a multi-agent environment setup show both methods converge significantly faster in comparison to the training process of the traditional PPO algorithm.
UR - https://www.scopus.com/pages/publications/85203828921
U2 - 10.1109/ICDL61372.2024.10644943
DO - 10.1109/ICDL61372.2024.10644943
M3 - Conference contribution
AN - SCOPUS:85203828921
T3 - 2024 IEEE International Conference on Development and Learning, ICDL 2024
BT - 2024 IEEE International Conference on Development and Learning, ICDL 2024
PB - Institute of Electrical and Electronics Engineers Inc.
T2 - 2024 IEEE International Conference on Development and Learning, ICDL 2024
Y2 - 20 May 2024 through 23 May 2024
ER -