Sim-to-Real with rex (Robotic Environments with jaX)

¤

This notebook offers an introductory tutorial for rex (Robotic Environments with jaX), a JAX-based framework for building graph-based environments designed for sim2real robotics.

In this tutorial, we will walk through a simple sim-to-real example using rex, where we will: 1. Define a simple pendulum system as an interconnected set of nodes, where: - brax is used as a stand-in for the real-world system. - We simulate real-world asynchronous effects by introducing communication and computation delays using predefined delay distributions. - The node definitions used in this notebook are covered in detail in the node_definitions.ipynb notebook. 2. Apply open-loop control to the pendulum system to gather data. 3. Use the collected data to: - Fit Gaussian Mixture Models (GMM) to estimate the delays introduced in step (1). - Build an ODE simulation environment. - Use evolutionary strategies to identify hidden delays and parameters in the ODE environment that best match the collected data. 4. Train an agent to balance the pendulum in the ODE environment using PPO (Proximal Policy Optimization). 5. Zero-shot transfer the trained agent to the real-world environment.

A Colab runtime with GPU acceleration is required. If you're using a CPU-only runtime, you can switch using the menu "Runtime > Change runtime type".

# @title Install Necessary Libraries
# @markdown This cell installs the required libraries for the project.
# @markdown If you are running this notebook in Google Colab, most libraries should already be installed.

import multiprocessing
import os


os.environ["XLA_FLAGS"] = "--xla_force_host_platform_device_count={}".format(max(multiprocessing.cpu_count(), 1))
os.environ["XLA_PYTHON_CLIENT_PREALLOCATE"] = "false"

try:
    import rex

    print("Rex already installed")
except ImportError:
    print(
        "Installing rex via `pip install rex-lib[examples]`. "
        "If you are running this in a Colab notebook, you can ignore this message."
    )
    !pip install rex-lib[examples]
    import rex

# @title Import Libraries & Check GPU Availability
# @markdown We import all necessary libraries here, including JAX, numpy, and others.
# @markdown Additionally, we check if a GPU is available and display the number of CPU cores.

import functools
import itertools

import equinox as eqx
import jax
import jax.numpy as jnp
import matplotlib
import matplotlib.pyplot as plt
import seaborn as sns
import supergraph
import tqdm
from distrax import Normal
from IPython.display import HTML


sns.set()

import rex.base as base
import rex.utils as rutils
from rex.base import TrainableDist
from rex.constants import Clock, RealTimeFactor
from rex.open_colors import ecolor, fcolor


# Check if we have a GPU
try:
    gpu = jax.devices("gpu")
    gpu = gpu[0] if len(gpu) > 0 else None
    print("GPU found!")
except RuntimeError:
    print("Warning: No GPU found, falling back to CPU. Speedups will be less pronounced.")
    print(
        "Hint: if you are using Google Colab, try to change the runtime to GPU: "
        "Runtime -> Change runtime type -> Hardware accelerator -> GPU."
    )
    gpu = None

# Check the number of available CPU cores
print(f"CPU cores available: {len(jax.devices('cpu'))}")
cpus = itertools.cycle(jax.devices("cpu"))

GPU found!
CPU cores available: 16

# @title Define Pendulum System as an Interconnection of Nodes
# @markdown We will use nodes defined in the pendulum example to simulate the system.
# @markdown Since we do not have access to a real-world pendulum, the Brax simulation will act as our "real-world" system.
# @markdown Data from Brax will help us identify the delays and parameters of a simple ODE model.
# @markdown In a separate notebook, we demonstrate how to define nodes.
# @markdown Optionally, you can test the system with zero delays by uncommenting the relevant code.
import rex.examples.pendulum as pdm


# `Color` and `order` arguments are merely for visualization purposes.
# Delay distributions are used to simulate the delays as if the nodes were real-world systems.
# For real-world systems, it is normally not necessary to specify the delay distributions.
sensor = pdm.Sensor(
    name="sensor",
    rate=50,
    color="pink",
    order=1,  # Sensor that reads the angle from the pendulum
    delay_dist=Normal(loc=0.0075, scale=0.003),
)  # Computation delay of the sensor
agent = pdm.Agent(
    name="agent",
    rate=50,
    color="teal",
    order=3,  # Agent that generates random actions
    delay_dist=Normal(loc=0.01, scale=0.003),
)  # Computation delay of the agent
actuator = pdm.Actuator(
    name="actuator",
    rate=50,
    color="orange",
    order=2,  # Actuator that applies the action to the pendulum
    delay_dist=Normal(loc=0.0075, scale=0.003),
)  # Computation delay of the actuator
# Computation delay of the world is the world's step size (i.e. 1/rate)
world = pdm.BraxWorld(name="world", rate=50, color="grape", order=0)  # Brax world that simulates the pendulum
nodes = dict(world=world, sensor=sensor, agent=agent, actuator=actuator)

# Connect nodes
# The window determine the buffer size, i.e., the number of previous messages that are stored and can be accessed
# in the .step() method of the node. The window should be at least 1, as the most recent message is always stored.
# Blocking connections are synchronous, i.e., the receiving node waits for the sending node to send a message.
# The window determines the number of messages that are stored and can be accessed in the .step() method of the node.
agent.connect(
    sensor,
    window=3,
    name="sensor",
    blocking=True,  # Use the last three sensor messages as input (sync communication)
    delay_dist=Normal(loc=0.002, scale=0.002),
)  # Communication delay of the sensor
actuator.connect(
    agent,
    window=1,
    name="agent",
    blocking=True,  # Agent receives the most recent action (sync communication)
    delay_dist=Normal(loc=0.002, scale=0.002),
)  # Communication delay of the agent

# Connections below would not be necessary in a real-world system,
# but are used to communicate the action to brax, and convert brax's state to a sensor message
# Delay distributions are used to simulate the delays in the real-world system
sensor_delay, actuator_delay = 0.01, 0.01
std_delay = 0.002
world.connect(
    actuator,
    window=1,
    name="actuator",
    skip=True,  # Sends the action to the brax world (skip=True to resolve circular dependency)
    delay_dist=Normal(loc=actuator_delay, scale=std_delay),
)  # Actuator delay between applying the action, and the action being effective in the world
sensor.connect(
    world,
    window=1,
    name="world",  # Communicate brax's state to the sensor node
    delay_dist=Normal(loc=sensor_delay, scale=std_delay),
)  # Sensor delay between reading the state, and the world's state corresponding to the sensor reading.

# If you want to test with zero delays, uncomment below.
# sensor_delay, actuator_delay = 0.0, 0.0
# std_delay = 0.0
# for n in [sensor, agent, actuator]:
#     n.set_delay(delay_dist=Deterministic(loc=0.0), delay=0.0)
#     for i in n.inputs.values():
#         i.set_delay(delay_dist=Deterministic(loc=0.0), delay=0.0)
# world.inputs["actuator"].set_delay(delay_dist=Deterministic(loc=0.0), delay=0.0)

# Visualize the system
node_infos = {name: n.info for name, n in nodes.items()}
fig, ax = plt.subplots(1, 1, figsize=(8, 3))
rutils.plot_system(node_infos, ax=ax, k=1)
ax.legend()
ax.set_title("Brax System");

2024-10-09 16:49:09.740762: W external/xla/xla/service/gpu/nvptx_compiler.cc:765] The NVIDIA driver's CUDA version is 12.2 which is older than the ptxas CUDA version (12.6.77). Because the driver is older than the ptxas version, XLA is disabling parallel compilation, which may slow down compilation. You should update your NVIDIA driver or use the NVIDIA-provided CUDA forward compatibility packages.

# @title Apply Open-Loop Control to the Pendulum System to Gather Data
# @markdown This section collects data such as delays, actions, and sensor readings,
# @markdown by applying open-loop control to the simulated pendulum.

# Build the graph
# Note that one of the nodes is designated as the supervisor (agent).
# To make a comparison with the standard Gym-like approach, the supervisor node is the agent, and the other nodes are the environment.
# This means that the graph will be executed in a step-by-step manner, where the agent's rate determines the rate of the environment.
from rex.asynchronous import AsyncGraph
from rex.constants import LogLevel
from rex.utils import set_log_level


graph = AsyncGraph(
    nodes=nodes,
    supervisor=nodes["agent"],
    # Settings for simulating at fast as possible speed according to specified delays
    clock=Clock.SIMULATED,
    real_time_factor=RealTimeFactor.FAST_AS_POSSIBLE,
    # Settings for simulating at real-time speed according to specified delays
    # clock=Clock.SIMULATED, real_time_factor=RealTimeFactor.REAL_TIME,
    # Settings for real-world deployment
    # clock=Clock.WALL_CLOCK, real_time_factor=RealTimeFactor.REAL_TIME,
)

# Specify what we want to record (params, state, output) for each node,
graph.set_record_settings(params=True, inputs=False, state=True, output=True)

# Get initial graph state (aggregate of all node states)
rng = jax.random.PRNGKey(2)
rng, rng_init = jax.random.split(rng)
# 'order' defines the order in which the nodes must be initialized (some node initialization procedures may depend on the result of others)
gs_init = graph.init(rng_init, order=("agent",))
gs_init_real = gs_init  # Used later for evaluating the trained model from the same initial state

# Ahead-of-time compilation of the step method of each node
# Place all nodes on the CPU, except the agent, which is placed on the GPU (if available)
[set_log_level(LogLevel.DEBUG, n) for n in nodes.values()]  # Silence the log output
devices_step = {k: next(cpus) if k != "agent" or gpu is None else gpu for k in nodes}
graph.warmup(gs_init, devices_step, jit_step=True, profile=True)  # Profile=True for profiling the step function

# Prepare open-loop action sequence
rng, rng_actions = jax.random.split(rng)
dt_action = 2.0
num_actions = 6
actions = jnp.array([-1.7, 1.7, -1, 1, 0.0, 0.1])[:, None]
actions = jnp.repeat(
    actions, int(jnp.ceil(dt_action * nodes["agent"].rate)), axis=0
)  # Repeat actions for the duration of the agent's rate
num_steps = actions.shape[0]

# Execution: Gym-like API with .reset() & .step() methods
# We use the graph state obtained with .init() and perform step-by-step execution with .reset() and .step().
gs, ss = graph.reset(gs_init)  # Reset the graph to the initial state (returns the gs and the step state of the agent)
for i in tqdm.tqdm(range(num_steps), desc="brax | gather data"):
    # Access the last sensor message of the input buffer
    # -1 is the most recent message, -2 the second most recent, etc. up until the window size
    sensor_msg = ss.inputs["sensor"][-1].data  # .data grabs the pytree message object
    action = actions[i]  # Get the action for the current time step
    output = ss.params.to_output(action)  # Convert the action to an output message
    # Step the graph (i.e., executes the next time step by sending the output message to the actuator node)
    gs, ss = graph.step(gs, ss, output)  # Step the graph with the agent's output
graph.stop()  # Stops all nodes that were running asynchronously in the background

# Get the episode data (params, delays, outputs, etc.)
record = graph.get_record()  # Gets the records of all nodes

# Filter out the world node, as it would not be available in a real-world system
rollout_real = record.nodes["world"].steps.state
nodes_real = {name: n for name, n in nodes.items() if name != "world"}
record = record.filter(nodes_real)

brax | gather data: 100%|███████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 600/600 [00:02<00:00, 204.31it/s]

# @title Visualize Actions and Sensor Readings
# @markdown The plots below display the actions and sensor readings as might be observed in a real-world system.

fig_data, axes = plt.subplots(1, 3, figsize=(12, 3))
axes[0].plot(record.nodes["agent"].steps.ts_end[:-1], record.nodes["agent"].steps.output.action, label="action")
axes[0].set_xlabel("Time [s]")
axes[0].set_ylabel("Torque [Nm]")
axes[0].legend()

axes[1].plot(record.nodes["sensor"].steps.ts_end, record.nodes["sensor"].steps.output.th, label="th")
axes[1].set_xlabel("Time [s]")
axes[1].set_ylabel("Angle [rad]")
axes[1].legend()

axes[2].plot(record.nodes["sensor"].steps.ts_end, record.nodes["sensor"].steps.output.thdot, label="thdot")
axes[2].set_xlabel("Time [s]")
axes[2].set_ylabel("Ang. Vel. [rad/s]")
axes[2].legend();

# @title Fit GMM to Communication and Computation Delays
# @markdown We will fit a Gaussian Mixture Model (GMM) to the delays observed in the communication between the sensor and agent,
# @markdown as well as the computation delay of the agent’s step method.
# @markdown Other delays, such as actuator delays, can be fitted similarly.

from rex.gmm_estimator import GMMEstimator


# Fit GMM to communication delay between sensor and agent
delay_comm = record.nodes["agent"].inputs["sensor"].messages.delay
gmm_comm = GMMEstimator(delay_comm, "communication_delay")
gmm_comm.fit(num_steps=100, num_components=2, step_size=0.05, seed=0)
dist_comm = gmm_comm.get_dist()

# Fit GMM to computation delay of the agent's step method
delay_comp = record.nodes["agent"].steps.delay
gmm_comp = GMMEstimator(delay_comp, "computation_delay")
gmm_comp.fit(num_steps=100, num_components=2, step_size=0.05, seed=0)
dist_comp = gmm_comp.get_dist()

communication_delay | Time taken: 1.38 seconds.
computation_delay | Time taken: 1.22 seconds.

# @title Visualize Fitted GMM for Delays
# @markdown This cell plots the GMM fitting process to the delays for both the sensor-agent communication and the agent's computation.

%matplotlib agg

# Plot GMMs
# with plt.ioff():
fig_gmm, axes = plt.subplots(1, 2, figsize=(8, 3))
gmm_comm.plot_hist(ax=axes[0], edgecolor=ecolor.communication, facecolor=fcolor.communication, plot_dist=True)
axes[0].set_title("Delay (sensor->agent)")
gmm_comp.plot_hist(ax=axes[1], edgecolor=ecolor.computation, facecolor=fcolor.computation, plot_dist=False)
axes[1].set_title("Delay (agent.step)")
for ax, dist in zip(axes, [dist_comm, dist_comp]):
    ax.xaxis.set_major_formatter(matplotlib.ticker.FormatStrFormatter("%.3f"))
    ax.tick_params(axis="both", which="major", labelsize=10)
    ax.tick_params(axis="both", which="minor", labelsize=8)
    ax.set_xlabel("delay (s)", fontsize=10)
    ax.set_ylabel("density", fontsize=10)
    ax.set_xlim([0, dist.quantile(0.99)])  # Limit the x-axis to the 99th percentile of the delay

# Animate training
ani = gmm_comp.animate_training(fig=fig_gmm, ax=axes[1], num_frames=50)
# If you are running into an AttributeError regarding "_val_or_rc", skip the HTML display and run the next cell.
# This seems to be a Python 3.10 + matplotlib 3.9.x issue.
# Resolve by downgrading matplotlib to 3.7.x. Run `!pip install matplotlib==3.7.5`.
HTML(ani.to_html5_video())

# @title Visualize Data Flow in the Real-World System
# @markdown The top plot shows how long each node takes to process data and forward it to the next node.
# @markdown The bottom plot provides a graph representation that will form the basis for the computational graph used for system identification.
# @markdown - Each vertex represents a step call of a node, and each edge represents message transmission between two nodes.
# @markdown - Edges between consecutive steps of the same node represent the transmission of the internal state of the node.
# @markdown - Nodes start processing after an initial phase-shift, which can be controlled in the node definition.

plt.close(fig_gmm)  # Close fig_gmm to prevent it from displaying in the next cell
%matplotlib inline

df = record.to_graph()
timing_mode = "arrival"  # "arrival" or "usage"
G = rutils.to_networkx_graph(df, nodes=nodes)
fig, axes = plt.subplots(2, 1, figsize=(12, 6))
rutils.plot_graph(
    G,
    max_x=0.5,
    ax=axes[0],
    message_arrow_timing_mode=timing_mode,
    edge_linewidth=1.4,
    arrowsize=10,
    show_labels=True,
    height=0.6,
    label_loc="center",
)
supergraph.plot_graph(G, max_x=0.5, ax=axes[1])
fig.suptitle("Real-world data flow from recording")
axes[-1].set_xlabel("Time [s]");

# @title Build an ODE Simulation Environment to Identify Hidden Delays and Parameters
# @markdown We use collected data to build and identify delays and parameters in a simple ODE model.
# @markdown This model incorporates the communication and computation delays identified for the agent.

# Prepare the recorded data that we are going to use for system identification
outputs = {name: n.steps.output[None] for name, n in record.nodes.items()}

# By reinitializing the nodes via the `from_info` method, we can reuse the exact same configuration (rate, delay_dist, etc.).
# We can overwrite (e.g., delay_dist) or specify extra parameters (e.g., outputs) as keyword arguments.
# The info data is stored in the record, but can also be obtained from the nodes themselves with node.info.
actuator = pdm.SimActuator.from_info(
    record.nodes["actuator"].info, outputs=outputs["actuator"]
)  # Actuator data to replay the actions
sensor = pdm.SimSensor.from_info(
    record.nodes["sensor"].info, outputs=outputs["sensor"]
)  # Sensor data to calculate reconstruction error
agent = pdm.Agent.from_info(record.nodes["agent"].info, delay_dist=dist_comp)
nodes_sim = dict(sensor=sensor, agent=agent, actuator=actuator)

# Connect nodes according to real-world system
[n.connect_from_info(record.nodes[name].info.inputs, nodes_sim) for name, n in nodes_sim.items()]

# Create the world node that is going to simulate the ODE system
world = pdm.OdeWorld.from_info(
    nodes["world"].info
)  # Initialize OdeWorld with the same parameters (rate, etc.) as the brax world

# Next, we connect the world node to the nodes that interface with hardware (actuator and sensor)
# We specify trainable delays to represent sensor and actuator delays that we want to identify in addition to the ode parameters
world.connect(
    actuator,
    window=1,
    name="actuator",
    skip=True,  # Sends the action to the ODE world (skip=True to resolve circular dependency)
    # Trainable delay to represent the actuator delay
    # delay, min, and max are seconds, interp in ["zoh", "linear"]
    delay_dist=TrainableDist.create(delay=0.0, min=0, max=0.3, interp="linear"),
)
sensor.connect(
    world,
    window=1,
    name="world",  # Communicate the ODE world's state to the sensor node
    # Trainable delay to represent the sensor delay
    # delay, min, and max are seconds, interp in ["zoh", "linear"]
    delay_dist=TrainableDist.create(delay=0.0, min=0, max=0.3, interp="linear"),
)
nodes_sim["world"] = world  # Add the world node to the nodes

# Visualize the system
node_infos = {name: n.info for name, n in nodes_sim.items()}
fig, ax = plt.subplots(1, 1, figsize=(8, 3))
rutils.plot_system(node_infos, ax=ax, k=1)
ax.legend()
ax.set_title("ODE System");

# @title Build Computational Graph for System Identification
# @markdown This graph includes vertices representing simulator (i.e. world) steps and edges representing sensor and actuator
# @markdown delays between the world and the sensor/actuator nodes.
# @markdown The min/max values from the trainable delay distributions are used to define these edges.

rng, rng_aug = jax.random.split(rng)
cg = rex.artificial.augment_graphs(df, nodes_sim, rng_aug)
timing_mode = "arrival"  # "arrival" or "usage"
G = rutils.to_networkx_graph(cg, nodes=nodes)
fig, axes = plt.subplots(2, 1, figsize=(12, 6))
rutils.plot_graph(
    G,
    max_x=0.5,
    ax=axes[0],
    message_arrow_timing_mode=timing_mode,
    edge_linewidth=1.4,
    arrowsize=10,
    show_labels=True,
    height=0.6,
    label_loc="center",
)
supergraph.plot_graph(G, max_x=0.5, ax=axes[1])
fig.suptitle("Computation graph (extended with simulator nodes)")
axes[-1].set_xlabel("Time [s]");

# @title Define Subset of Trainable Parameters (Delays and ODE Parameters)
# @markdown The following loop describes the training process for identifying hidden delays and system parameters:
# @markdown 1. Sample normalized parameters from a search distribution.
# @markdown 2. Denormalize based on parameter min/max values.
# @markdown 3. Extend trainable parameters with non-trainable ones.
# @markdown 4. Run simulation and collect reconstruction errors.
# @markdown 5. Update search distribution based on the error.
# @markdown 6. Repeat until convergence.

# Initialize a graph that can be compiled and parallelized for system identification
# Note, we could choose to skip running the agent node for computational efficiency,
# as we know it does not affect the world node in this case, as we are replaying the actions in the actuator node.
graph_sim = rex.graph.Graph(nodes_sim, nodes_sim["agent"], cg)

# Get initial graph state (aggregate of all node states)
rng, rng_init = jax.random.split(rng)
gs_init = graph_sim.init(rng_init, order=("agent",))
gs_init_sim = gs_init

# Define the set of trainable parameters and the initial values
# We only want to optimize for a subset of the parameters, e.g., the delays and the parameters of the ODE system.
# Hence, we take all parameters, set them them to None (i.e., not trainable),
# and then set the ones we want to optimize to trainable values.
base_params = gs_init.params.unfreeze().copy()  # Get base structure for params
init_params = jax.tree_util.tree_map(lambda x: None, base_params)  # Set all parameters to None (i.e. not trainable)
init_params["world"] = init_params["world"].replace(
    J=0.0001,  # Inertia of the pendulum (trainable)
    mass=0.05,  # Mass of the pendulum (trainable)
    length=0.03,  # Length of the pendulum (trainable)
    b=1.0e-05,  # Damping of the pendulum (trainable)
    K=0.02,  # Spring constant of the pendulum (trainable)
    R=5.0,  # DC-motor resistance of the pendulum (trainable)
    c=0.0007,
)  # Coulomb friction of the pendulum (trainable)
init_params["sensor"] = init_params["sensor"].replace(sensor_delay=0.15)  # Sensor delay (trainable)
init_params["actuator"] = init_params["actuator"].replace(actuator_delay=0.15)  # actuator delay (trainable)
init_params["agent"] = init_params["agent"].replace(
    init_method="parametrized",  # Set to "parametrized" avoid random state initialization
    parametrized=jnp.array([0.5 * jnp.pi, 0.0]),
)  # Initial state (trainable)

# Print the initial parameters
print("Initial parameters (None means not trainable, some are static):")
eqx.tree_pprint(init_params, short_arrays=False)

# It's also good practice to perform a search over normalized parameters, provided we are given a min and max for each parameter.
min_params, max_params = init_params.copy(), init_params.copy()  # Get base structure for min and max params
# Set the min and max for the ODE parameters
min_params["world"] = jax.tree_util.tree_map(lambda x: x * 0.25, min_params["world"])  # Set the min for the ODE parameters
max_params["world"] = jax.tree_util.tree_map(lambda x: x * 2.0, max_params["world"])  # Set the max for the ODE parameters
# Set the min and max for the delays
min_params["sensor"] = min_params["sensor"].replace(sensor_delay=0.0)  # Set the min for the sensor delay
max_params["sensor"] = max_params["sensor"].replace(sensor_delay=0.3)  # Set the max for the sensor delay
min_params["actuator"] = min_params["actuator"].replace(actuator_delay=0.0)  # Set the min for the actuator delay
max_params["actuator"] = max_params["actuator"].replace(actuator_delay=0.3)  # Set the max for the actuator delay
# Ensure agent's initial state has a non-zero range, as 0.5*0 = 0, and 1.5*0 = 0
# if max_params["agent"].parametrized is not None:  # todo: remove
min_params = eqx.tree_at(
    lambda _min: _min["agent"].parametrized, min_params, jnp.array([-jnp.pi, -0.2])
)  # Set the min for the initial state
max_params = eqx.tree_at(
    lambda _max: _max["agent"].parametrized, max_params, jnp.array([jnp.pi, 0.2])
)  # Set the max for the initial state

# Next, we define the transform that transforms the normalized candidate parameters to the full parameter structure
# First, we denormalize the parameters, then extend with the non-trainable parameters (e.g., max_speed of the ODE world)
denorm = base.Denormalize.init(min_params, max_params)  # Create a transform to denormalize a set of normalized parameters
extend = base.Extend.init(base_params, init_params)  # Create a transform to extend the trainable params with the non-trainable
denorm_extend = base.Chain.init(denorm, extend)

# Normalize the initial, min, and max parameters
norm_init_params = denorm.inv(init_params)  # Normalize the initial parameters
norm_min_params = denorm.inv(min_params)  # Normalize the min parameters
norm_max_params = denorm.inv(max_params)  # Normalize the max parameters

Growing supergraph: 100%|███████████████████████████████████████| 601/601 [00:00<00:00, 907.73it/s, 1/1 graphs, 2411/2411 matched (66.86% efficiency, 6 nodes (pre-filtered: 6 nodes))]

Initial parameters (None means not trainable, some are static):
{
  'actuator':
  ActuatorParams(actuator_delay=0.15),
  'agent':
  AgentParams(
    policy=None,
    num_act=4,
    num_obs=4,
    max_torque=None,
    init_method='parametrized',
    parametrized=Array([1.5707964, 0.       ], dtype=float32),
    max_th=None,
    max_thdot=None,
    gamma=None,
    tmax=None
  ),
  'sensor':
  SensorParams(sensor_delay=0.15),
  'world':
  OdeParams(
    max_speed=None,
    J=0.0001,
    mass=0.05,
    length=0.03,
    b=1e-05,
    K=0.02,
    R=5.0,
    c=0.0007,
    dt_substeps_min=0.01,
    dt=0.02
  )
}

# @title Define Loss Function for Identifying Delays and Parameters
# @markdown This function calculates the reconstruction error for a given set of normalized parameters.
# @markdown The error is used to guide optimization during the training process.


def get_loss(norm_params, transform, rng):
    # Transform normalized parameters to full parameter structure
    params = transform.apply(norm_params)  # := denorm_extend.apply(norm_params)

    # Initialize the graph state
    # By supplying the params, we override the params generated by every node's init_params method
    # This allows us to run the graph with the specified parameters
    gs_init = graph_sim.init(rng=rng, params=params, order=("agent",))

    # Rollout graph
    final_gs = graph_sim.rollout(gs_init, carry_only=True)

    # Get the reconstruction error
    loss_th = final_gs.state["sensor"].loss_th
    loss_thdot = final_gs.state["sensor"].loss_thdot
    loss = loss_th + loss_thdot
    return loss


# Get cost of initial guess
init_loss = get_loss(norm_init_params, denorm_extend, rng)  # Get the initial loss
print(f"Loss of initial guess: {init_loss}")  # Loss using the initial parameters

Loss of initial guess: 11958.76171875

# @title Initialize CMA-ES Solver for System Identification
# @markdown We will use the Covariance Matrix Adaptation Evolution Strategy (CMA-ES) to optimize parameters.
# @markdown The solver is initialized with the normalized parameter bounds (min and max).

import rex.evo as evo


# Initialize the solver
max_steps = 50  # Number of optimization steps
strategy_kwargs = dict(popsize=200, elite_ratio=0.1, sigma_init=0.4, mean_decay=0.0, n_devices=1)
solver = evo.EvoSolver.init(norm_min_params, norm_max_params, strategy="CMA_ES", strategy_kwargs=strategy_kwargs)
init_sol_state = solver.init_state(norm_init_params)  # Initialize the solver state

# Run the optimization
rng, rng_sol = jax.random.split(rng)
init_log_state = solver.init_logger(num_generations=max_steps)
with rutils.timer("evo | compile + optimize"):
    sol_state, log_state, losses = evo.evo(
        get_loss, solver, init_sol_state, denorm_extend, max_steps=max_steps, rng=rng_sol, verbose=True, logger=init_log_state
    )
norm_opt_params = solver.unflatten(sol_state.best_member)
opt_params = denorm_extend.apply(norm_opt_params)

# Print identified delays vs true delays
# Note that it's inherently not possible to distinguish between sensor and actuator delays, but we can estimate their sum.
# Hence, we compare the sum of the identified delays with the sum of the true delays.
# print(f"Sensor delay | true={sensor_delay:.3f}\u00B1{std_delay:.3f}, opt={opt_params['sensor'].sensor_delay:.3f}, init={init_params['sensor'].sensor_delay:.3f}")
# print(f"Actuator delay | true={actuator_delay:.3f}\u00B1{std_delay:.3f}, opt={opt_params['actuator'].actuator_delay:.3f}, init={init_params['actuator'].actuator_delay:.3f}")
print(
    f"Actuator+senor delay | "
    f"true={sensor_delay+actuator_delay:.3f}\u00b1{std_delay * 2:.3f}, "
    f"opt={opt_params['sensor'].sensor_delay+opt_params['actuator'].actuator_delay:.3f}, "
    f"init={init_params['sensor'].sensor_delay+init_params['actuator'].actuator_delay:.3f}"
)


def rollout(params, rng, carry_only: bool = True):
    # Initialize the graph state
    # By supplying the params, we override the params generated by every node's init_params method
    # This allows us to run the graph with the specified parameters
    gs_init = graph_sim.init(rng=rng, params=params, order=("agent",))

    # Rollout graph
    gs_rollout = graph_sim.rollout(gs_init, carry_only=carry_only)
    return gs_rollout


rng, rng_rollout = jax.random.split(rng)
init_rollout = rollout(extend.apply(init_params), rng_rollout, carry_only=False)
opt_rollout = rollout(opt_params, rng_rollout, carry_only=False)

ParameterReshaper: 11 parameters detected for optimization.
ParameterReshaper: 11 parameters detected for optimization.
step: 0 | min_loss: 1127.36083984375 | mean_loss: 107012.1796875 | max_loss: 568760.6875 | bestsofar_loss: 1127.36083984375 | total_samples: 200
step: 1 | min_loss: 1204.197509765625 | mean_loss: 78481.2890625 | max_loss: 522921.8125 | bestsofar_loss: 1127.36083984375 | total_samples: 400
step: 2 | min_loss: 492.77667236328125 | mean_loss: 36297.34375 | max_loss: 560355.5 | bestsofar_loss: 492.77667236328125 | total_samples: 600
step: 3 | min_loss: 798.9423217773438 | mean_loss: 24496.9765625 | max_loss: 370595.75 | bestsofar_loss: 492.77667236328125 | total_samples: 800
step: 4 | min_loss: 205.81195068359375 | mean_loss: 28551.529296875 | max_loss: 472859.34375 | bestsofar_loss: 205.81195068359375 | total_samples: 1000
step: 5 | min_loss: 52.2267951965332 | mean_loss: 23809.375 | max_loss: 484273.875 | bestsofar_loss: 52.2267951965332 | total_samples: 1200
step: 6 | min_loss: 230.4680633544922 | mean_loss: 12314.4443359375 | max_loss: 296713.3125 | bestsofar_loss: 52.2267951965332 | total_samples: 1400
step: 7 | min_loss: 79.33822631835938 | mean_loss: 4005.635498046875 | max_loss: 144555.671875 | bestsofar_loss: 52.2267951965332 | total_samples: 1600
step: 8 | min_loss: 94.1173095703125 | mean_loss: 7078.138671875 | max_loss: 260285.21875 | bestsofar_loss: 52.2267951965332 | total_samples: 1800
step: 9 | min_loss: 43.1370849609375 | mean_loss: 5270.595703125 | max_loss: 281373.15625 | bestsofar_loss: 43.1370849609375 | total_samples: 2000
step: 10 | min_loss: 52.78352737426758 | mean_loss: 2540.908935546875 | max_loss: 216970.453125 | bestsofar_loss: 43.1370849609375 | total_samples: 2200
step: 11 | min_loss: 21.664243698120117 | mean_loss: 1918.5511474609375 | max_loss: 115976.4609375 | bestsofar_loss: 21.664243698120117 | total_samples: 2400
step: 12 | min_loss: 23.087072372436523 | mean_loss: 485.77374267578125 | max_loss: 3315.34619140625 | bestsofar_loss: 21.664243698120117 | total_samples: 2600
step: 13 | min_loss: 17.36151123046875 | mean_loss: 770.6312866210938 | max_loss: 92419.09375 | bestsofar_loss: 17.36151123046875 | total_samples: 2800
step: 14 | min_loss: 15.771574020385742 | mean_loss: 217.87913513183594 | max_loss: 2219.96484375 | bestsofar_loss: 15.771574020385742 | total_samples: 3000
step: 15 | min_loss: 13.985638618469238 | mean_loss: 156.47499084472656 | max_loss: 930.321044921875 | bestsofar_loss: 13.985638618469238 | total_samples: 3200
step: 16 | min_loss: 14.507802963256836 | mean_loss: 91.28081512451172 | max_loss: 345.5765075683594 | bestsofar_loss: 13.985638618469238 | total_samples: 3400
step: 17 | min_loss: 10.358619689941406 | mean_loss: 63.55476379394531 | max_loss: 610.5410766601562 | bestsofar_loss: 10.358619689941406 | total_samples: 3600
step: 18 | min_loss: 11.19699764251709 | mean_loss: 43.670433044433594 | max_loss: 201.0476837158203 | bestsofar_loss: 10.358619689941406 | total_samples: 3800
step: 19 | min_loss: 10.942270278930664 | mean_loss: 35.828548431396484 | max_loss: 147.9638214111328 | bestsofar_loss: 10.358619689941406 | total_samples: 4000
step: 20 | min_loss: 10.034892082214355 | mean_loss: 26.11482810974121 | max_loss: 97.91928100585938 | bestsofar_loss: 10.034892082214355 | total_samples: 4200
step: 21 | min_loss: 9.910079956054688 | mean_loss: 24.807268142700195 | max_loss: 62.91225814819336 | bestsofar_loss: 9.910079956054688 | total_samples: 4400
step: 22 | min_loss: 9.878521919250488 | mean_loss: 19.883729934692383 | max_loss: 67.5021743774414 | bestsofar_loss: 9.878521919250488 | total_samples: 4600
step: 23 | min_loss: 9.988247871398926 | mean_loss: 15.602483749389648 | max_loss: 28.748577117919922 | bestsofar_loss: 9.878521919250488 | total_samples: 4800
step: 24 | min_loss: 9.414042472839355 | mean_loss: 13.810941696166992 | max_loss: 25.670448303222656 | bestsofar_loss: 9.414042472839355 | total_samples: 5000
step: 25 | min_loss: 9.660612106323242 | mean_loss: 13.065614700317383 | max_loss: 24.568443298339844 | bestsofar_loss: 9.414042472839355 | total_samples: 5200
step: 26 | min_loss: 9.325304985046387 | mean_loss: 12.260498046875 | max_loss: 19.5605411529541 | bestsofar_loss: 9.325304985046387 | total_samples: 5400
step: 27 | min_loss: 9.467074394226074 | mean_loss: 11.700087547302246 | max_loss: 17.791507720947266 | bestsofar_loss: 9.325304985046387 | total_samples: 5600
step: 28 | min_loss: 9.31025218963623 | mean_loss: 11.482476234436035 | max_loss: 22.758207321166992 | bestsofar_loss: 9.31025218963623 | total_samples: 5800
step: 29 | min_loss: 9.248895645141602 | mean_loss: 11.012178421020508 | max_loss: 27.078828811645508 | bestsofar_loss: 9.248895645141602 | total_samples: 6000
step: 30 | min_loss: 9.093639373779297 | mean_loss: 10.656970024108887 | max_loss: 16.620359420776367 | bestsofar_loss: 9.093639373779297 | total_samples: 6200
step: 31 | min_loss: 8.997458457946777 | mean_loss: 10.727160453796387 | max_loss: 21.239608764648438 | bestsofar_loss: 8.997458457946777 | total_samples: 6400
step: 32 | min_loss: 8.898832321166992 | mean_loss: 10.218587875366211 | max_loss: 15.635639190673828 | bestsofar_loss: 8.898832321166992 | total_samples: 6600
step: 33 | min_loss: 8.828442573547363 | mean_loss: 10.003084182739258 | max_loss: 15.199397087097168 | bestsofar_loss: 8.828442573547363 | total_samples: 6800
step: 34 | min_loss: 8.915704727172852 | mean_loss: 9.74703598022461 | max_loss: 12.05509090423584 | bestsofar_loss: 8.828442573547363 | total_samples: 7000
step: 35 | min_loss: 8.837331771850586 | mean_loss: 9.452198028564453 | max_loss: 11.589494705200195 | bestsofar_loss: 8.828442573547363 | total_samples: 7200
step: 36 | min_loss: 8.703187942504883 | mean_loss: 9.30600357055664 | max_loss: 10.700343132019043 | bestsofar_loss: 8.703187942504883 | total_samples: 7400
step: 37 | min_loss: 8.815437316894531 | mean_loss: 9.199464797973633 | max_loss: 10.256194114685059 | bestsofar_loss: 8.703187942504883 | total_samples: 7600
step: 38 | min_loss: 8.725391387939453 | mean_loss: 9.130782127380371 | max_loss: 9.658961296081543 | bestsofar_loss: 8.703187942504883 | total_samples: 7800
step: 39 | min_loss: 8.653090476989746 | mean_loss: 9.083394050598145 | max_loss: 9.625836372375488 | bestsofar_loss: 8.653090476989746 | total_samples: 8000
step: 40 | min_loss: 8.663204193115234 | mean_loss: 9.016453742980957 | max_loss: 9.460453987121582 | bestsofar_loss: 8.653090476989746 | total_samples: 8200
step: 41 | min_loss: 8.673643112182617 | mean_loss: 8.977757453918457 | max_loss: 9.457200050354004 | bestsofar_loss: 8.653090476989746 | total_samples: 8400
step: 42 | min_loss: 8.62848949432373 | mean_loss: 8.939888954162598 | max_loss: 9.32252311706543 | bestsofar_loss: 8.62848949432373 | total_samples: 8600
step: 43 | min_loss: 8.626632690429688 | mean_loss: 8.919873237609863 | max_loss: 9.297188758850098 | bestsofar_loss: 8.626632690429688 | total_samples: 8800
step: 44 | min_loss: 8.698042869567871 | mean_loss: 8.914870262145996 | max_loss: 9.22347640991211 | bestsofar_loss: 8.626632690429688 | total_samples: 9000
step: 45 | min_loss: 8.631864547729492 | mean_loss: 8.902450561523438 | max_loss: 9.235267639160156 | bestsofar_loss: 8.626632690429688 | total_samples: 9200
step: 46 | min_loss: 8.659423828125 | mean_loss: 8.88314437866211 | max_loss: 9.170170783996582 | bestsofar_loss: 8.626632690429688 | total_samples: 9400
step: 47 | min_loss: 8.633057594299316 | mean_loss: 8.862720489501953 | max_loss: 9.144399642944336 | bestsofar_loss: 8.626632690429688 | total_samples: 9600
step: 48 | min_loss: 8.608020782470703 | mean_loss: 8.86151123046875 | max_loss: 9.157724380493164 | bestsofar_loss: 8.608020782470703 | total_samples: 9800
step: 49 | min_loss: 8.585338592529297 | mean_loss: 8.846010208129883 | max_loss: 9.186552047729492 | bestsofar_loss: 8.585338592529297 | total_samples: 10000
[47821][MainThread               ][tracer              ][evo | compile + optimize] Elapsed: 12.5141 sec
Actuator+senor delay | true=0.020±0.004, opt=0.029, init=0.300

# @title Plot Optimization Loss
# @markdown This plot shows the loss during the parameter optimization process.
# @markdown Lower losses indicate better fit between the model and collected data.

fig_loss, ax_loss = plt.subplots(1, 1, figsize=(4, 3))
log_state.plot("Loss", fig=fig_loss, ax=ax_loss)
ax_loss.set_yscale("log");

# @title Visualize Reconstructed and True Sensor Readings
# @markdown The following plots show the comparison between the true sensor readings and the reconstructed readings.
# @markdown A close match with both the observe and optimized lines on top of each other suggests the model is accurately capturing the system behavior.

# Rollout the optimized parameters
init_sensor = init_rollout.inputs["agent"]["sensor"].data[:, -1]
init_ts_sensor = init_rollout.inputs["agent"]["sensor"].ts_sent[:, -1]
init_actuator = init_rollout.inputs["world"]["actuator"].data[:, -1]
init_ts_actuator = init_rollout.inputs["world"]["actuator"].ts_sent[:, -1]

opt_sensor = opt_rollout.inputs["agent"]["sensor"].data[:, -1]
opt_ts_sensor = opt_rollout.inputs["agent"]["sensor"].ts_sent[:, -1]
opt_actuator = opt_rollout.inputs["world"]["actuator"].data[:, -1]
opt_ts_actuator = opt_rollout.inputs["world"]["actuator"].ts_sent[:, -1]

fig, axes = plt.subplots(1, 3, figsize=(12, 3))
axes[0].plot(record.nodes["agent"].steps.ts_end[:-1], record.nodes["agent"].steps.output.action, label="action")
axes[0].plot(opt_ts_actuator, opt_actuator.action[:, 0], label="action (ode, opt)")
axes[0].plot(init_ts_actuator, init_actuator.action[:, 0], label="action (ode, init)")
axes[0].set_xlabel("Time [s]")
axes[0].set_ylabel("Torque [Nm]")
axes[0].legend()

axes[1].plot(record.nodes["sensor"].steps.ts_end, record.nodes["sensor"].steps.output.th, label="th (brax)")
axes[1].plot(opt_ts_sensor, opt_sensor.th, label="th (ode, opt)")
axes[1].plot(init_ts_sensor, init_sensor.th, label="th (ode, init)")
axes[1].set_xlabel("Time [s]")
axes[1].set_ylabel("Angle [rad]")
axes[1].legend()

axes[2].plot(record.nodes["sensor"].steps.ts_end, record.nodes["sensor"].steps.output.thdot, label="thdot (brax)")
axes[2].plot(opt_ts_sensor, opt_sensor.thdot, label="thdot (ode, opt)")
axes[2].plot(init_ts_sensor, init_sensor.thdot, label="thdot (ode, init)")
axes[2].set_xlabel("Time [s]")
axes[2].set_ylabel("Ang. Vel. [rad/s]")
axes[2].legend();

# @title Train a Policy to Swing-Up the Pendulum Using PPO
# @markdown We will train a policy to swing up the pendulum using Proximal Policy Optimization (PPO) on the identified system.
# @markdown The success rate is the percentage of steps where the pendulum remains upright (cos(theta) > 0.95 and |theta_dot| < 0.5).
# @markdown We train 5 policies in parallel and select the best one based on the mean return

# Reinitialize a graph with nodes (that do not replay actions and calculate reconstruction error)
infos_sim = {name: n.info for name, n in nodes_sim.items()}
nodes_rl = {name: n.from_info(infos_sim[name]) for name, n in nodes_sim.items()}
[n.connect_from_info(infos_sim[name].inputs, nodes_rl) for name, n in nodes_rl.items()]
graph_rl = rex.graph.Graph(nodes_rl, nodes_rl["agent"], cg)

# Define the environment
env = pdm.rl.SwingUpEnv(graph=graph_rl)

# Set RL params
rl_params = opt_params.copy()  # Get base structure for params
rl_params["agent"] = rl_params["agent"].replace(init_method="random")
env.set_params(rl_params)

# Initialize PPO config
# sweep_pmv2r1zf is a PPO hyperparameter sweep that was found to work well for the pendulum swing-up task
config = pdm.rl.sweep_pmv2r1zf

# Train (success rate is the percentage of steps where the pendulum remains upright)
import rex.ppo as ppo


rng, rng_train = jax.random.split(rng)
rngs_train = jax.random.split(rng_train, num=5)  # Train 5 policies in parallel
train = functools.partial(ppo.train, env)
with rutils.timer("ppo | compile"):
    train_v = jax.vmap(train, in_axes=(None, 0))
    train_vjit = jax.jit(train_v)
    train_vjit = train_vjit.lower(config, rngs_train).compile()
with rutils.timer("ppo | train"):
    res = train_vjit(config, rngs_train)

# Get best policy (based on res.metrics["eval/mean_returns"])
best_idx = jnp.argmax(res.metrics["eval/mean_returns"][:, -1])
best_policy = res.policy[best_idx]
eval_params = rl_params.copy()
eval_params["agent"] = eval_params["agent"].replace(init_method="random", policy=best_policy)

Growing supergraph: 100%|██████████████████████████████████████| 601/601 [00:00<00:00, 1297.43it/s, 1/1 graphs, 2411/2411 matched (66.86% efficiency, 6 nodes (pre-filtered: 6 nodes))]

[47821][MainThread               ][tracer              ][ppo | compile       ] Elapsed: 32.4849 sec
train_steps=249856 | eval_eps=20 | return=-723.0+-131.5 | length=147+-0.0 | approxkl=0.0033 | success_rate=0.01
train_steps=249856 | eval_eps=20 | return=-1036.5+-83.7 | length=147+-0.0 | approxkl=0.0035 | success_rate=0.00
train_steps=249856 | eval_eps=20 | return=-466.0+-90.1 | length=147+-0.0 | approxkl=0.0034 | success_rate=0.04
train_steps=249856 | eval_eps=20 | return=-385.1+-68.9 | length=147+-0.0 | approxkl=0.0035 | success_rate=0.04
train_steps=249856 | eval_eps=20 | return=-369.4+-74.8 | length=147+-0.0 | approxkl=0.0034 | success_rate=0.03
train_steps=499712 | eval_eps=20 | return=-429.4+-90.5 | length=147+-0.0 | approxkl=0.0032 | success_rate=0.03
train_steps=499712 | eval_eps=20 | return=-1008.8+-84.3 | length=147+-0.0 | approxkl=0.0025 | success_rate=0.00
train_steps=499712 | eval_eps=20 | return=-416.1+-83.5 | length=147+-0.0 | approxkl=0.0032 | success_rate=0.04
train_steps=499712 | eval_eps=20 | return=-363.7+-61.2 | length=147+-0.0 | approxkl=0.0033 | success_rate=0.05
train_steps=499712 | eval_eps=20 | return=-369.1+-80.7 | length=147+-0.0 | approxkl=0.0037 | success_rate=0.08
train_steps=749568 | eval_eps=20 | return=-367.7+-92.7 | length=147+-0.0 | approxkl=0.0040 | success_rate=0.10
train_steps=749568 | eval_eps=20 | return=-1052.2+-64.1 | length=147+-0.0 | approxkl=0.0021 | success_rate=0.00
train_steps=749568 | eval_eps=20 | return=-353.8+-72.0 | length=147+-0.0 | approxkl=0.0032 | success_rate=0.06
train_steps=749568 | eval_eps=20 | return=-364.4+-77.3 | length=147+-0.0 | approxkl=0.0035 | success_rate=0.08
train_steps=749568 | eval_eps=20 | return=-338.8+-98.4 | length=147+-0.0 | approxkl=0.0042 | success_rate=0.08
train_steps=999424 | eval_eps=20 | return=-185.5+-108.5 | length=147+-0.0 | approxkl=0.0049 | success_rate=0.49
train_steps=999424 | eval_eps=20 | return=-1036.0+-89.5 | length=147+-0.0 | approxkl=0.0019 | success_rate=0.00
train_steps=999424 | eval_eps=20 | return=-367.8+-71.5 | length=147+-0.0 | approxkl=0.0035 | success_rate=0.07
train_steps=999424 | eval_eps=20 | return=-369.3+-92.4 | length=147+-0.0 | approxkl=0.0037 | success_rate=0.08
train_steps=999424 | eval_eps=20 | return=-322.1+-83.2 | length=147+-0.0 | approxkl=0.0043 | success_rate=0.11
train_steps=1249280 | eval_eps=20 | return=-235.5+-229.1 | length=147+-0.0 | approxkl=0.0059 | success_rate=0.48
train_steps=1249280 | eval_eps=20 | return=-1011.7+-101.8 | length=147+-0.0 | approxkl=0.0019 | success_rate=0.00
train_steps=1249280 | eval_eps=20 | return=-348.9+-86.0 | length=147+-0.0 | approxkl=0.0040 | success_rate=0.06
train_steps=1249280 | eval_eps=20 | return=-341.9+-60.0 | length=147+-0.0 | approxkl=0.0040 | success_rate=0.08
train_steps=1249280 | eval_eps=20 | return=-314.3+-76.8 | length=147+-0.0 | approxkl=0.0048 | success_rate=0.17
train_steps=1499136 | eval_eps=20 | return=-189.4+-267.2 | length=147+-0.0 | approxkl=0.0054 | success_rate=0.59
train_steps=1499136 | eval_eps=20 | return=-1047.6+-75.7 | length=147+-0.0 | approxkl=0.0019 | success_rate=0.00
train_steps=1499136 | eval_eps=20 | return=-370.2+-89.9 | length=147+-0.0 | approxkl=0.0039 | success_rate=0.08
train_steps=1499136 | eval_eps=20 | return=-353.9+-97.6 | length=147+-0.0 | approxkl=0.0044 | success_rate=0.10
train_steps=1499136 | eval_eps=20 | return=-325.4+-95.4 | length=147+-0.0 | approxkl=0.0054 | success_rate=0.10
train_steps=1748992 | eval_eps=20 | return=-218.9+-240.5 | length=147+-0.0 | approxkl=0.0059 | success_rate=0.56
train_steps=1748992 | eval_eps=20 | return=-1025.6+-91.8 | length=147+-0.0 | approxkl=0.0017 | success_rate=0.00
train_steps=1748992 | eval_eps=20 | return=-356.3+-76.1 | length=147+-0.0 | approxkl=0.0047 | success_rate=0.06
train_steps=1748992 | eval_eps=20 | return=-352.3+-82.7 | length=147+-0.0 | approxkl=0.0046 | success_rate=0.07
train_steps=1748992 | eval_eps=20 | return=-383.8+-85.8 | length=147+-0.0 | approxkl=0.0058 | success_rate=0.05
train_steps=1998848 | eval_eps=20 | return=-157.8+-234.5 | length=147+-0.0 | approxkl=0.0051 | success_rate=0.63
train_steps=1998848 | eval_eps=20 | return=-1063.5+-79.0 | length=147+-0.0 | approxkl=0.0019 | success_rate=0.00
train_steps=1998848 | eval_eps=20 | return=-342.3+-54.8 | length=147+-0.0 | approxkl=0.0051 | success_rate=0.11
train_steps=1998848 | eval_eps=20 | return=-354.5+-72.3 | length=147+-0.0 | approxkl=0.0052 | success_rate=0.09
train_steps=1998848 | eval_eps=20 | return=-365.6+-70.5 | length=147+-0.0 | approxkl=0.0061 | success_rate=0.07
train_steps=2248704 | eval_eps=20 | return=-197.4+-236.4 | length=147+-0.0 | approxkl=0.0046 | success_rate=0.56
train_steps=2248704 | eval_eps=20 | return=-997.9+-96.3 | length=147+-0.0 | approxkl=0.0018 | success_rate=0.01
train_steps=2248704 | eval_eps=20 | return=-319.3+-75.6 | length=147+-0.0 | approxkl=0.0053 | success_rate=0.09
train_steps=2248704 | eval_eps=20 | return=-350.2+-193.7 | length=147+-0.0 | approxkl=0.0054 | success_rate=0.10
train_steps=2248704 | eval_eps=20 | return=-377.5+-242.6 | length=147+-0.0 | approxkl=0.0065 | success_rate=0.20
train_steps=2498560 | eval_eps=20 | return=-187.4+-231.2 | length=147+-0.0 | approxkl=0.0047 | success_rate=0.59
train_steps=2498560 | eval_eps=20 | return=-1011.7+-79.2 | length=147+-0.0 | approxkl=0.0017 | success_rate=0.00
train_steps=2498560 | eval_eps=20 | return=-312.9+-78.5 | length=147+-0.0 | approxkl=0.0051 | success_rate=0.09
train_steps=2498560 | eval_eps=20 | return=-309.7+-96.8 | length=147+-0.0 | approxkl=0.0057 | success_rate=0.11
train_steps=2498560 | eval_eps=20 | return=-342.6+-111.9 | length=147+-0.0 | approxkl=0.0070 | success_rate=0.20
train_steps=2748416 | eval_eps=20 | return=-126.0+-68.6 | length=147+-0.0 | approxkl=0.0047 | success_rate=0.63
train_steps=2748416 | eval_eps=20 | return=-859.8+-186.3 | length=147+-0.0 | approxkl=0.0017 | success_rate=0.03
train_steps=2748416 | eval_eps=20 | return=-353.9+-86.6 | length=147+-0.0 | approxkl=0.0050 | success_rate=0.08
train_steps=2748416 | eval_eps=20 | return=-305.2+-84.7 | length=147+-0.0 | approxkl=0.0063 | success_rate=0.14
train_steps=2748416 | eval_eps=20 | return=-273.4+-106.8 | length=147+-0.0 | approxkl=0.0076 | success_rate=0.34
train_steps=2998272 | eval_eps=20 | return=-144.8+-86.9 | length=147+-0.0 | approxkl=0.0055 | success_rate=0.62
train_steps=2998272 | eval_eps=20 | return=-400.5+-92.5 | length=147+-0.0 | approxkl=0.0026 | success_rate=0.05
train_steps=2998272 | eval_eps=20 | return=-320.5+-91.8 | length=147+-0.0 | approxkl=0.0056 | success_rate=0.10
train_steps=2998272 | eval_eps=20 | return=-320.8+-99.0 | length=147+-0.0 | approxkl=0.0070 | success_rate=0.13
train_steps=2998272 | eval_eps=20 | return=-250.0+-109.5 | length=147+-0.0 | approxkl=0.0094 | success_rate=0.41
train_steps=3248128 | eval_eps=20 | return=-110.1+-64.6 | length=147+-0.0 | approxkl=0.0060 | success_rate=0.64
train_steps=3248128 | eval_eps=20 | return=-151.2+-111.2 | length=147+-0.0 | approxkl=0.0033 | success_rate=0.48
train_steps=3248128 | eval_eps=20 | return=-298.4+-107.5 | length=147+-0.0 | approxkl=0.0063 | success_rate=0.16
train_steps=3248128 | eval_eps=20 | return=-308.5+-89.5 | length=147+-0.0 | approxkl=0.0073 | success_rate=0.12
train_steps=3248128 | eval_eps=20 | return=-154.7+-78.6 | length=147+-0.0 | approxkl=0.0102 | success_rate=0.55
train_steps=3497984 | eval_eps=20 | return=-148.9+-93.6 | length=147+-0.0 | approxkl=0.0062 | success_rate=0.60
train_steps=3497984 | eval_eps=20 | return=-182.0+-101.1 | length=147+-0.0 | approxkl=0.0046 | success_rate=0.55
train_steps=3497984 | eval_eps=20 | return=-344.1+-81.6 | length=147+-0.0 | approxkl=0.0067 | success_rate=0.08
train_steps=3497984 | eval_eps=20 | return=-246.4+-94.6 | length=147+-0.0 | approxkl=0.0084 | success_rate=0.30
train_steps=3497984 | eval_eps=20 | return=-257.5+-269.4 | length=147+-0.0 | approxkl=0.0088 | success_rate=0.46
train_steps=3747840 | eval_eps=20 | return=-118.4+-82.1 | length=147+-0.0 | approxkl=0.0070 | success_rate=0.65
train_steps=3747840 | eval_eps=20 | return=-190.9+-117.0 | length=147+-0.0 | approxkl=0.0048 | success_rate=0.56
train_steps=3747840 | eval_eps=20 | return=-350.0+-195.4 | length=147+-0.0 | approxkl=0.0070 | success_rate=0.16
train_steps=3747840 | eval_eps=20 | return=-320.2+-88.5 | length=147+-0.0 | approxkl=0.0091 | success_rate=0.17
train_steps=3747840 | eval_eps=20 | return=-153.3+-102.3 | length=147+-0.0 | approxkl=0.0075 | success_rate=0.58
train_steps=3997696 | eval_eps=20 | return=-158.4+-109.6 | length=147+-0.0 | approxkl=0.0083 | success_rate=0.61
train_steps=3997696 | eval_eps=20 | return=-111.6+-77.6 | length=147+-0.0 | approxkl=0.0051 | success_rate=0.64
train_steps=3997696 | eval_eps=20 | return=-288.1+-67.9 | length=147+-0.0 | approxkl=0.0079 | success_rate=0.12
train_steps=3997696 | eval_eps=20 | return=-288.1+-214.5 | length=147+-0.0 | approxkl=0.0095 | success_rate=0.31
train_steps=3997696 | eval_eps=20 | return=-135.9+-92.0 | length=147+-0.0 | approxkl=0.0072 | success_rate=0.62
train_steps=4247552 | eval_eps=20 | return=-134.3+-82.8 | length=147+-0.0 | approxkl=0.0081 | success_rate=0.63
train_steps=4247552 | eval_eps=20 | return=-146.5+-108.1 | length=147+-0.0 | approxkl=0.0051 | success_rate=0.63
train_steps=4247552 | eval_eps=20 | return=-280.4+-72.4 | length=147+-0.0 | approxkl=0.0084 | success_rate=0.16
train_steps=4247552 | eval_eps=20 | return=-252.2+-219.3 | length=147+-0.0 | approxkl=0.0107 | success_rate=0.36
train_steps=4247552 | eval_eps=20 | return=-192.1+-245.3 | length=147+-0.0 | approxkl=0.0068 | success_rate=0.60
train_steps=4497408 | eval_eps=20 | return=-125.9+-81.8 | length=147+-0.0 | approxkl=0.0085 | success_rate=0.62
train_steps=4497408 | eval_eps=20 | return=-209.4+-127.2 | length=147+-0.0 | approxkl=0.0051 | success_rate=0.54
train_steps=4497408 | eval_eps=20 | return=-297.3+-90.2 | length=147+-0.0 | approxkl=0.0090 | success_rate=0.18
train_steps=4497408 | eval_eps=20 | return=-250.1+-214.0 | length=147+-0.0 | approxkl=0.0115 | success_rate=0.39
train_steps=4497408 | eval_eps=20 | return=-153.8+-98.9 | length=147+-0.0 | approxkl=0.0076 | success_rate=0.62
train_steps=4747264 | eval_eps=20 | return=-147.7+-88.6 | length=147+-0.0 | approxkl=0.0088 | success_rate=0.62
train_steps=4747264 | eval_eps=20 | return=-185.2+-114.9 | length=147+-0.0 | approxkl=0.0056 | success_rate=0.59
train_steps=4747264 | eval_eps=20 | return=-395.1+-259.8 | length=147+-0.0 | approxkl=0.0101 | success_rate=0.09
train_steps=4747264 | eval_eps=20 | return=-214.4+-81.5 | length=147+-0.0 | approxkl=0.0115 | success_rate=0.40
train_steps=4747264 | eval_eps=20 | return=-194.1+-123.8 | length=147+-0.0 | approxkl=0.0079 | success_rate=0.57
train_steps=4997120 | eval_eps=20 | return=-161.5+-102.1 | length=147+-0.0 | approxkl=0.0072 | success_rate=0.57
train_steps=4997120 | eval_eps=20 | return=-170.6+-66.3 | length=147+-0.0 | approxkl=0.0070 | success_rate=0.58
train_steps=4997120 | eval_eps=20 | return=-243.9+-118.5 | length=147+-0.0 | approxkl=0.0110 | success_rate=0.28
train_steps=4997120 | eval_eps=20 | return=-244.7+-219.0 | length=147+-0.0 | approxkl=0.0115 | success_rate=0.40
train_steps=4997120 | eval_eps=20 | return=-148.2+-111.6 | length=147+-0.0 | approxkl=0.0076 | success_rate=0.64
[47821][MainThread               ][tracer              ][ppo | train         ] Elapsed: 50.7621 sec

# @title Visualize PPO Training Progress
# @markdown The plots below show the training progress of the PPO algorithm in terms of returns, success rate, and policy KL divergence.

fig_ppo, axes_ppo = plt.subplots(1, 3, figsize=(12, 3))
total_steps = res.metrics["train/total_steps"].transpose()
mean, std = res.metrics["eval/mean_returns"].transpose(), res.metrics["eval/std_returns"].transpose()
axes_ppo[0].plot(total_steps, mean, label="mean")
axes_ppo[0].set_title("Returns")
axes_ppo[0].set_xlabel("Total steps")
axes_ppo[0].set_ylabel("Cum. return")
mean = res.metrics["eval/success_rate"].transpose()
axes_ppo[1].plot(total_steps, mean, label="mean")
axes_ppo[1].set_title(r"Success ($\cos(\theta) > 0.95$ & $|\dot{\theta}| < 0.5$)")
axes_ppo[1].set_xlabel("Total steps")
axes_ppo[1].set_ylabel("Upright [% of steps]")
mean, std = res.metrics["train/mean_approxkl"].transpose(), res.metrics["train/std_approxkl"].transpose()
axes_ppo[2].plot(total_steps, mean, label="mean")
axes_ppo[2].set_title("Policy KL")
axes_ppo[2].set_xlabel("Total steps")
axes_ppo[2].set_ylabel("Approx. kl");

# @title Evaluate the Learned Policy on the Simulated System (i.e. used during training)
# @markdown We evaluate the learned policy by running multiple rollouts in parallel.
# @markdown The success rate is calculated as the percentage of time the pendulum remains upright and still.

num_rollouts = 20_000  # Lower if memory is an issue
max_steps = int(5 * nodes_sim["agent"].rate)  # 5 seconds

# Check if we have a GPU
try:
    gpu = jax.devices("gpu")
except RuntimeError:
    num_rollouts = 100  # Lower if no GPU is available
    print(
        "Warning: No GPU found, falling back to CPU. Speedups will be less pronounced. Lowering the number of rollouts to 100."
    )
    print(
        "Hint: if you are using Google Colab, try to change the runtime to GPU: "
        "Runtime -> Change runtime type -> Hardware accelerator -> GPU."
    )


def rollout_fn(rng):
    # Initialize graph state
    _gs = graph_rl.init(rng, params=eval_params, order=("agent",))
    # Make sure to record the state
    _gs = graph_rl.init_record(_gs, params=False, state=True, output=False)
    # Run the graph for a fixed number of steps
    _gs_rollout = graph_rl.rollout(_gs, carry_only=True, max_steps=max_steps)
    # This returns a record that may only be partially filled.
    record = _gs_rollout.aux["record"]
    is_filled = record.nodes["world"].steps.seq >= 0  # Unfilled steps are marked with -1
    return is_filled, record.nodes["world"].steps.state


rng, rng_rollout = jax.random.split(rng)
rngs_rollout = jax.random.split(rng_rollout, num=num_rollouts)
t_jit = rutils.timer(
    f"Vectorized evaluation of {num_rollouts} rollouts | compile", log_level=100
)  # Makes them available outside the context manager
with t_jit:
    rollout_fn_jv = jax.jit(jax.vmap(rollout_fn))
    rollout_fn_jv = rollout_fn_jv.lower(rngs_rollout)
    rollout_fn_jv = rollout_fn_jv.compile()
t_run = rutils.timer(f"Vectorized evaluation of {num_rollouts} rollouts | rollouts", log_level=100)
with t_run:
    is_filled, final_states = rollout_fn_jv(rngs_rollout)
    final_states.th.block_until_ready()

# Only keep the filled rollouts (we did not run the full duration of the computation graph)
final_states = final_states[is_filled]

# Calculate success rate
thr_upright = 0.95  # Cosine of the angle threshold
thr_static = 0.5  # Angular velocity threshold
cos_th = jnp.cos(final_states.th)
thdot = final_states.thdot
is_upright = cos_th > thr_upright
is_static = jnp.abs(thdot) < thr_static
is_valid = jnp.logical_and(is_upright, is_static)
success_rate = is_valid.sum() / is_valid.size
print(f"sim. eval | Success rate: {success_rate:.2f}")
print(
    f"sim. eval | fps: {(num_rollouts * max_steps) / t_run.duration / 1e6:.0f} Million steps/s | compile: {t_jit.duration:.2f} s | run: {t_run.duration:.2f} s"
)

[47821][MainThread               ][tracer              ][Vectorized evaluation of 20000 rollouts | compile] Elapsed: 5.4447 sec
[47821][MainThread               ][tracer              ][Vectorized evaluation of 20000 rollouts | rollouts] Elapsed: 0.1476 sec
sim. eval | Success rate: 0.81
sim. eval | fps: 34 Million steps/s | compile: 5.44 s | run: 0.15 s

# @title Evaluate the Learned Policy on the "Real" Brax System (i.e. sim2real transfer)
# @markdown We will now evaluate the learned policy on the real Brax simulation system, which we used to collect data in the beginning.


@jax.jit
def get_action(step_state: base.StepState):
    obs = eval_params["agent"].get_observation(step_state)
    action = eval_params["agent"].policy.get_action(obs)
    output = eval_params["agent"].to_output(action)  # Convert the action to an output message
    new_ss = step_state.replace(state=eval_params["agent"].update_state(step_state, action))
    return new_ss, output


# Run for one episode
gs, ss = graph.reset(gs_init_real)  # Reset the graph to the initial state (returns the gs and the step state of the agent)
for i in tqdm.tqdm(range(max_steps), desc="brax | evaluate policy"):
    new_ss, output = get_action(ss)
    gs, ss = graph.step(gs, new_ss, output)  # Step the graph with the agent's output
graph.stop()  # Stops all nodes that were running asynchronously in the background
# Get the record
eval_record = graph.get_record()  # Get the record of the graph
eval_real_rollout = eval_record.nodes["world"].steps.state
# Filter out the world node, as it would not be available in a real-world system
eval_record = eval_record.filter(nodes_real)

brax | evaluate policy: 100%|███████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 250/250 [00:01<00:00, 208.06it/s]

# @title Visualize Actions and Sensor Readings in the Real-World System
# @markdown The following plots display the actions and sensor readings during the evaluation of the policy on the real-world system.

fig, axes = plt.subplots(1, 3, figsize=(12, 3))
axes[0].plot(eval_record.nodes["agent"].steps.ts_end[:-1], eval_record.nodes["agent"].steps.output.action, label="action")
axes[0].set_xlabel("Time [s]")
axes[0].set_ylabel("Torque [Nm]")
axes[0].legend()

axes[1].plot(eval_record.nodes["sensor"].steps.ts_end, eval_record.nodes["sensor"].steps.output.th, label="th")
axes[1].set_xlabel("Time [s]")
axes[1].set_ylabel("Angle [rad]")
axes[1].legend()

axes[2].plot(eval_record.nodes["sensor"].steps.ts_end, eval_record.nodes["sensor"].steps.output.thdot, label="thdot")
axes[2].set_xlabel("Time [s]")
axes[2].set_ylabel("Ang. Vel. [rad/s]")
axes[2].legend();

# @title Visualize the Rollout
# @markdown The following visualization shows the rollout of the pendulum swing-up task, displaying the system's behavior over time.
# @markdown Note: Html visualization may not work properly if rendering simultaneously in multiple cells.
# @markdown In such cases, comment-out all but one HTML(pendulum.render(rollout)).
HTML(pdm.render.render(eval_real_rollout, dt=float(1 / world.rate)))

Brax visualizer

<script type="module">
  import {Viewer} from 'viewer';
  const domElement = document.getElementById("brax-viewer");
  var viewer = new Viewer(domElement, system);
</script>