Action Space

Actions can generally be divided into planning and control. Some environments offer actions like goto(object), which belong to planning. Planning actions cannot operate in unannotated scenes. The potential of control actions is significantly greater, both in terms of operational capability and generalization ability. LEGENT adopts control actions. However, LEGENT currently does not employ real robot controls, which theoretically would require precise control of each joint's rotation and more, making it overly complex for researchers not working with physical robots. While ensuring that the action is a control type, we have also simplified the control difficulty as much as possible.

Below is the actions of an agent. Note that, along with basic move-forward action (equivalent to pressing the forward key on the keyboard), LEGENT also supports moving forward over any distance immediately (teleport_forward). During the continuous move-forward process, the information increment in repeatedly "move forward, move forward, move forward" is very small. This is not a great issue for small models. However, in cases where the computation cost is significantly high for large models during training, using the 'move forward with a distance' can greatly increase the information of the samples. When deployed for use, the model infers the distance to move forward, allowing to wait until this distance is covered before performing the next inference, effectively avoiding the huge overhead brought by inferring every frame. This action design is firstly employed by LEGENT, as a platform aimed at large models.

Action	Descriptions	Details
text	text send to the player	string. If it is empty, it means nothing to sent.
move_forward	move forward in the next frame	bool. If True, go forward. When `use_teleport==False`, it becomes effective.
teleport_forward	move forward with a distance	float. The number of meters to travel forward. When `use_teleport==True`, it becomes effective.
rotate_right	rotate camera horizontally	float. [-180, 180). Positive value means rotating right. Negative values mean rotating left
rotate_down	rotate camera vertically	float. [-90, 90). Positive value means rotating downwards. Negative values mean rotating upwards
grab	grab	bool. If True and the agent is holding an object, grab the object at the center of the image. If True and not holding, put the object on the surface at the center of the image
api_calls	api calls to the environment	List[Callable]. The api returns will be put in the returned observations.

The types of these actions vary, but they are all expressed by codes for the model. For example:

speak("OK")
move_forward(2.4)
rotate_right(35)

Below is the APIs provided to python by the environment.

API	Descriptions	Params	Returns
PathToUser	Obtain the key points of the path to walk towards the player. The agent can walk to the player along the key points one by one in straight line without barriers in between.	None	api_returns['corners'] is the list of key points.
PathToObject	Obtain the key points of the path to walk towards an object.	The index of the object in the scene config	api_returns['corners'] is the list of key points.