Sunghoon Hong 2022.04.28

[ICLR 2022] Deep Reinforcement Learning Technologies for Effective Single General-Purpose Policy

Sunghoon Hong, who joined the Applied AI Research Lab at LG AI Research in March 2022, previously researched in the lab of KAIST Professor Keeeung Kim, who is renowned for his research on reinforcement learning (RL). The lab of Professor Kim is conducting a number of studies on RL that combine fundamental theories and natural language, as well as research on probabilistic reasoning. Using RL and graph neural networks (GNNs) in tandem, Researcher Hong conducted research on the representation of complex data in reality.

Speaking on the background of his interest in the research field, Hong commented, “My initial interest in machine learning was centered on reinforcement learning. I felt that the existing method of data representation was limited, so I studied graph neural networks to develop a way to represent data more freely.”

Hong first became interested in machine learning in 2016, when AlphaGo was released and caused a boom in the artificial intelligence industry. While in his third year of computer science at the time, Hong was contemplating his future career when he became interested in AlphaGo. After having realized that the technology used to implement AlphaGo was closely related to his major, he began to study related fields.

Although it is true that AlphaGo played a significant role in Hong's decision to focus on RL, particularly in the field of machine learning, he chose to study RL primarily because he believed it to be the technology most closely related to the future of strong artificial intelligence. He was intrigued by the fact that sequential decision making in RL differs from other machine learning techniques, but is similar to how humans learn.

Working in Professor Keeeung Kim's laboratory was unavoidable, given his enthusiasm for RL. Sunghoon Hong, who was searching for a graduate school laboratory in Korea to conduct in-depth AI research, determined that Professor Kim's lab, which had been researching RL for many years, was most appropriate.

Hong's primary research conducted in Professor Kim's lab was the integration of RL and graphs. As opposed to well-refined data for research, such as state information, few research has been conducted on graph data that is close to reality. Consequently, it is even more challenging to find research that combines this with RL. This is the reason why Hong conducts research using RL to process complex data such as graphs. Hong employs GNNs, which are neural networks comprised of the graph's nodes and edges, to improve data representation. This is because the connections and correlations between nodes in a GNNs are well incorporated into the structure of the neural network itself.

The research adopted at ICLR 2022, “Structure-Aware Transformer Policy for Inhomogeneous Multi-Task Reinforcement Learning,” is a continuation of this. This study addresses the points that must be supplemented to solve a given problem, going beyond merely utilizing GNNs.

The research that we will introduce in this post is the research that Hong conducted in the lab of KAIST Professor Keeeung Kim, which was published in ICLR 2022.

Reinforcement learning is one of the ways to solve sequential decision-making problems. This method trains the AI agent to maximize the reward sum when the next state and reward are given by the environment according to the action performed by the agent. Nowadays, technologies such as the widely known AlphaGo and AlphaStar provide good performance in various problems when combined with deep learning. However, each existing method mainly trains the agent on a single policy to solve a problem only, which gives rise to the low data efficiency problem. Consequently, what emerges is the single general-purpose policy to learn and solve multiple problems at the same time. For example, consider a general-purpose policy to control various types of robots (Figure 1).

(Figure 1) (Left) Training the policy neural network that controls each of the three robots separately. (Right) Training a single general-purpose policy neural network that controls all three robots.

This paper deals with the single general-purpose policy for the study goal of training human- or animal-shaped robots composed of multiple joints to move forward by controlling the torque of each joint (Figure 2). ^[1]

(Figure 2) Introduction to the composition of the articulated robot and problems controlling various robots. You may consider the articulated robot as a graph wherein nodes composed of the body and joint pairs are connected. The goal is to move forward by applying an appropriate force to the joint when the current state representing sensor information is given to each body.

The problem is that the size and structural form of the data representation for the state and the action varies in the robots. Previous studies approach this problem largely in two ways.

The first one represents the robot composed of multiple joints as a graph and uses a graph neural network to indicate the robot state information.^[2,3] The graph neural network has nodes and considers their connections as input. In the robot, each joint becomes a node. The state of each node, such as the physical location and the moving speed, are represented in vectors. This data representation is called a "message." Updating a message with the sum of the message of each node and adjacent nodes to the node is called ‘message passing’. The graph neural network repeats the message passing operation.

In this operation, it is possible to process multiple nodes easily since a neural network of a node is used to process all other nodes in parallel as well as reflect the robot’s morphology information since the message is passed as reflecting the graph connectivity. While repeating the message passing, however, messages are mixed among adjacent nodes, leading to the over-smoothing problem wherein they have similar data representations; thus limiting the performance.^[4]

The other uses the transformer model, which is effectively utilized in various deep learning areas as a super-giant AI framework. It uses the self-attention operation under the assumption that all nodes are connected to each other without considering the graph connectivity.^[5] The self-attention operation is message passing that uses the similarity between a node and each other node and updates with the sum weighted by the similarity. This method exhibits good performance thanks to the powerful transformer model but is limited since the graph connectivity or the robot’s morphology information is not considered.

A common limitation of previous methods is that the graph connectivity or the robot morphology information is not fully utilized. Thus, we added structural embedding that represents the morphology information to the transformer model to consider it. Structural embedding consists of positional embedding that represents the node position and relational embedding that describes the relationship between nodes. To express positional embedding or position of a node in a graph, you may assume a tree rooted at a node on the graph and tree traversal, just like representing the position of a word in a text with its order. Taking into account the fact that a tree is specified when other traversals, including in-order traversal, are combined, the positional embedding of a node is expressed as the combination of embeddings learned from positions in each tree traversal. Relational embedding is learned from the Laplacian matrix representing the relationship between two nodes, the shortest distance, and the personalized PageRank. Structural embedding learned this way is added to the self-attention operation to reflect the morphology information better.

We named this structural embedding-based single general-purpose policy as SWAT (Structure-aWAre Transformer). The SWAT model looks into the agent's current state through three steps of encoding, message passing, and decoding as shown in Figure 3 to determine the action to perform. First, the encoding step converts the node feature composed of sensor information that each node gets into node embedding. Next, the message passing step adds positional embedding obtained from the morphology information to node embedding, processes the self-attention operation adding the attention score that means the inter-node similarity, and updates node embedding. At this moment, it is possible to obtain the morphology information-reflected data representation due to positional embedding and relational embedding. Finally, the decoding step outputs the action vector for each node to perform when the finally obtained node embedding is given. This agent is trained through Soft Actor-Critic, the typical reinforcement learning algorithm.^[6]

(Figure 3) The overview of the SWAT method proposed in this paper.

We organized an experimental group by combining various types of robots that are different from each other, including those in various human and animal morphologies. We conducted experiments under the robot gait simulation environment (Figure 4). Based on the SWAT model we are proposing, the graph neural network SMP model, and the transformer AMORPHEUS model, we compared their performances. We compared multi-task learning and transfer learning. The former is the gait performance of the robots we have seen while learning, and the latter is the gait performance of those in new morphologies that we have not seen while learning when teaching them continually.

(Figure 4) Robot types in various morphologies, such as humans and animals^[7]

Based on the result, we found that the proposed method provided overall fast learning speed and final performance superior to existing methods in general (Figure 5). In particular, gaps with existing methods were much greater if different types of robots were mixed. Considering the agent morphology information, the transformer model was found to have learned more effective data representation, leading to learning the single general-purpose policy better for complicated robot gaits.

(Figure 5) Performance comparison results among the method proposed in this paper (SWAT) and the existing ones (SMP and AMORPHEUS). The title of each graph represents the robot type. For instance, the Hopper indicates the one-legged robot, and the Humanoid, the two-legged robot. ++ means that it is the result tested on the respective robot group. Walker-Humanoid++ means the union of the Walker group and the Humanoid group. The X-axis indicates the learning time, and the Y-axis, the gait performance (the larger it is, the more advanced forward).

For more detailed analysis, we observed the agent’s actions when training hopper robots that jump on one leg to move forward together with humanoid robots that walk on two legs (Figure 6). Based on the result, the transfer AMORPHEUS model trained without the morphology information showed the inefficient action of jumping and moving forward like the hopper robot even though the robot is two-legged. On the other hand, the SWAT model in this paper succeeded in learning the actions of moving forward with both legs crossed like a human.

(Figure 6) Gaits of the humanoid robot trained differently depending on the presence of the morphology information. First Row: Gaits of the hopper robot, Second Row: The AMORPHEUS case trained without the morphology information. The robot hops like a hopper robot. Third Row: The SWAT case trained with the morphology information. The robot knows how to run like a human.

In general, training a model for multiple problems is expected to increase its learning capacity by using the knowledge shared among similar ones. If the data representation learning is not conducted properly, however, such may interfere with learning on the contrary, as in the case above. In such sense, we learned that it was essential to study appropriate data representation when training a single general-purpose policy to solve multiple problems.

This article introduced a study on the reinforcement learning method for the single general-purpose policy to solve multiple problems effectively. If the single general-purpose policy technology advances, we believe that we can create agents that work various tasks excellently at the same time beyond the current level of an agent merely doing a job well. Imagine a cleaning robot that is good at dishwashing as well as laundry. Definitely, the hyperscale general-purpose policy-based reinforcement learning to solve all sequential decision-making problems (which do not end with a single decision but require sequential decision making) is tomorrow's story. Still, we will keep marching forward toward more ambitious goals.

▶Structure-Aware Transformer Policy for Inhomogeneous Multi-Task Reinforcement Learning(Link)

참고: [1] Hong et al., “Structure-Aware Transformer Policy for Inhomogeneous Multi-Task Reinforcement Learning”, ICLR, 2022
[2] Wang et al., “NerveNet: Learning Structured Policy with Graph Neural Networks”, ICLR, 2018
[3] Huang et al., “One Policy to Control Them All: Shared Modular Policies for Agent-Agnostic Control”, ICML, 2020
[4] Li et al., “Deeper Insights into Graph Convolutional Networks for Semi-Supervised Learning”, AAAI, 2018
[5] Kurin et al., “My Body is a Cage: The Role of Morphology in Graph-based Incompatible Control”, ICLR, 2021
[6] Haarnoja et al., “Soft Actor-Critic: Off-Policy Maximum Entropy Deep Reinforcement Learning with a Stochastic Actor”, ICML, 2018
[7] https://wenlong.page/modular-rl/

목록보기