Building a Reinforcement Learning Agent that can Play Rocket League

Sohum Padhye
12 min readMar 25, 2023

In my last article, I showed how I created a Super Mario Bros. game-playing reinforcement learning agent. After doing that, I started looking at other, more complex and mechanically demanding games that I could train a reinforcement learning agent on. That’s when I found Rocket League.

However, at the end of my last article, I said I would look more into Unity’s ML-Agents package. At that time, I decided I wanted to build a robotic arm that could collect objects and place them in a target area, like an automated trash collector and sorter.

I was very excited to start this project, but as it went on it became clear that it could take me a very long time to complete it. After about 2 months of work on this project, I hadn’t even finished the environment setup, nor did I start to create the RL agent that would learn to collect the items. Unity and C# were very new things to me, and it became clear that I would have to spend much more time learning them sufficiently to complete the project.

Several problems arose, forcing me to constantly change my code and the dynamics of the environment, and it got to a point where I decided that I was making no progress, and decided to scrap the idea as I simply didn’t know enough about Unity and C# to create a project that fully worked.

I realized that the main problem was the fact that I needed to make a custom environment and find a way to connect it to the agent.

So, I decided to steer away from Unity and look at pre-made environments that I could use to implement reinforcement learning. While browsing different websites, I saw that you could modify Rocket League (a soccer game, but with cars instead of humans) to make an environment in which a reinforcement learning agent could learn to play it. I instantly started looking more into it.

In this article, I’ll go over the following topics (if you want an overview of reinforcement learning and how it works, I highly suggest you check out my last article):

  • RLGym
  • The Reinforcement Learning Algorithm — Proximal Policy Optimization
  • Code Overview
  • Training
  • Code Demo (Video)
  • Final Thoughts
  • Extra Resources

RLGym

Rocket League was made to be a simple game, where you would maneuver a car to hit a ball into a goal. However, the original version of the game doesn’t allow players to train a game-playing agent efficiently (it doesn’t allow them to change the game speed, and many other factors are involved). So, the creators of RLGym (Rocket League Gym) made a way to access a Rocket League virtual environment that would be suitable for training a Rocket League reinforcement learning agent which would be able to play through games much faster than the regular game speed.

RLGym was designed to be like OpenAI Gym, which you might know. All it does is provide an environment that you can instantiate; then, you can use various tools from the RLGym library to help with designing and tweaking your code for the reward systems, terminal conditions, and more (I will explain these more in the code overview).

Using RLGym, a small team of programmers created and trained a bot named Nexto, which was able to beat players in the Grand Champion rank (top 1%).

However, this was not the only instance where reinforcement learning has been able to surpass humans.

A huge example of reinforcement learning defeating humans is chess. You may have heard of AlphaZero, a chess bot that was able to beat one of the greatest chess players of all time, Garry Kasparov. AlphaZero’s chess rating is estimated to be 3600. To put that into perspective, the current world champion, Magnus Carlsen, only has a rating of about 2850.

I’m getting off-topic. My point is that, given enough time, the newest reinforcement learning bots in Rocket League may be able to surpass professional players.

The Reinforcement Learning Algorithm — Proximal Policy Optimization

The algorithm used in this project where I create a reinforcement learning agent is proximal policy optimization (PPO), an easy-to-implement reinforcement learning algorithm created by OpenAI.

The “policy” in PPO essentially means the weights and biases of the neural network, or how it gets the output from the input.

PPO’s advantage over other reinforcement learning algorithms can be summarized into four parts:

  • Trust Region Optimization
  • Clipped Surrogate Objective Function
  • On-Policy Updates
  • Parallel Environments

Trust Region Optimization

PPO uses a technique called trust region optimization, which limits the amount of change that can happen to the neural network’s policy. This helps make unnecessary changes and can prevent an unstable training practice. Without trust region optimization, the neural network’s policy may bring itself to a worse state than it originally was and have no way of getting back to its original state.

You can think of it like trying to go to a destination in your car, but the roads that might take you there keep changing, and obstacles keep appearing — they force you to stray from the most direct route. While simply driving anywhere may eventually lead you to the destination, it’s dangerous and can instead result in you getting lost and not being able to find your way back. Instead, with PPO, you can define a boundary around the most direct route. This is a much safer alternative and ensures that you don’t stray too far away from your intended route and target position.

Clipped Surrogate Objective Function

The clipped surrogate objective function is the specific function in which trust region optimization is implemented. This ensures that whenever a change is made in the policy, it is not too different from the current policy.

There is a lot of math that goes on behind this, so I’ll try to give an intuitive definition with a little bit of math.

Source

Essentially, to calculate the loss function, we want to take the minimum of two terms:

  • r_t(θ) * Â_t
  • clip(r_t(θ), 1-ε, 1+ε) * Â_t

Where:

  • θ represents the policy parameters of the neural network
  • r_t(θ) is the ratio of the probability of choosing an action with the new policy to the probability of choosing an action with the old policy.
  • Â_t is the advantage function, which tells us the extent to which taking a specific action will improve the agent’s performance.
  • ε (epsilon) is a hyperparameter which controls the size of the clipping region, or how big or small you want your changes in the neural network to be.

The first term is the unclipped objective function, which encourages bigger policy changes, which can lead to instability in the network.

That’s where the second term comes in. The clip() function essentially checks if its value r_t(θ) sits within the range of 1-ε and 1+ε. If r_t(θ) does not fall within this range, then it is clipped to one of the boundaries.

With all of this, we can train our agent more stably because of the clipped surrogate objective function. However, this function wouldn’t be possible to implement without on-policy updates.

On-Policy Updates

Typical reinforcement learning algorithms use something called off-policy updates, which is when the neural network collects and uses data from when it had a different policy, to update its current policy. On-policy updates are when the neural network uses data from its current policy and uses that to update its policy.

Off-policy updates are more sample efficient than on-policy updates, meaning the neural network doesn’t have to learn from as much data as the on-policy network would have to, to achieve a good policy. However, off-policy updates can be quite unstable compared to on-policy updates. So, PPO adds a bit of training time to the overall course but ensures that the policy is updated stably.

You’ll see just how much on-policy updates help with training in the next section.

Parallel Environments

Lastly, training times can be significantly reduced by running the training in multiple instances, effectively gaining more data in a smaller amount of time. With PPO, these parallel environments are used to collect “mini-batches” of experiences that are used to update the neural network’s policy.

Each of these mini-batches contains a set of something called transitions, which are tuples of (state, action, reward, next state) that the agent experiences during training.

Using parallel environments can help reduce the correlation between transitions in a mini-batch. Essentially, when just one environment is running, the transitions within a mini-batch are highly correlated as they are all the result of the same sequence of actions the agent takes.

However, when running the training in parallel environments, the transitions are less correlated. When we have data that is less correlated, it leads to more stable updates to the policy, and therefore, more stable training.

Code Overview

When starting to follow a tutorial for making this, I was expecting to be making a lot of code. However, RLGym helps make things a lot less complicated as it has several helper functions and libraries.

In my GitHub version, I left comments on all the lines I could (unless the creator of the base of the code already did), so if you don’t know what a specific line means, you can always check out my version of the code.

Now I’ll be going over some of the key parts of this code.

Imports/Libraries

import numpy as np
from rlgym.envs import Match
from rlgym.utils.action_parsers import DiscreteAction
from stable_baselines3 import PPO
from stable_baselines3.common.callbacks import CheckpointCallback
from stable_baselines3.common.vec_env import VecMonitor, VecNormalize, VecCheckNan
from stable_baselines3.ppo import MlpPolicy

from rlgym.utils.obs_builders import AdvancedObs
from rlgym.utils.state_setters import DefaultState
from rlgym.utils.terminal_conditions.common_conditions import TimeoutCondition, NoTouchTimeoutCondition, GoalScoredCondition
from rlgym_tools.sb3_utils import SB3MultipleInstanceEnv
from rlgym.utils.reward_functions.common_rewards.misc_rewards import EventReward
from rlgym.utils.reward_functions.common_rewards.player_ball_rewards import VelocityPlayerToBallReward
from rlgym.utils.reward_functions.common_rewards.ball_goal_rewards import VelocityBallToGoalReward
from rlgym.utils.reward_functions import CombinedReward

Here we’re importing NumPy, which is a package that helps a lot with scientific computing.

We also have a few imports from Stable Baselines 3, which contains a set of reinforcement learning algorithms and relevant helper functions.

Finally, we are importing a lot of RLGym utilities, which help with various things…

First off, on line 2 we are importing Match, which is essentially an instance of a game.

For line 3 — there are two types of action parsing. One is continuous, and the other is discrete. If you don’t know what continuous and discrete actions are, I’ve given an explanation below.

For example, let’s say that you have a neural network that tells you how much to steer your car. It can range from -1 to 1, with -1 being all the way to the left, 0 being no steering, and 1 being all the way to the right. A continuous action space would allow you to pick any value between -1 and 1 (e.g. -0.543468, 0.75648). However, a discrete action space would only allow actions in increments, like -1, 0, 1 (which only allow turning all the way to the left, not turning, or turning all the way to the right). So continuous action spaces seem much better, right?

Well, not exactly. While using continuous action spaces can help make your model more precise, it may take a lot longer to train. In Rocket League, this makes a huge impact on training time as there are many actions that can be taken in a continuous space (accel. power, brake power, steering angle, and car tilt). However, if you’ve ever played Rocket League, you’ll notice that whether these actions are continuous or not, they do not affect a player’s performance too much. So, for the purpose of making this project, using a discrete action space is much better.

From RLGym we also import terminal conditions, which essentially decide when to reset the environment (e.g. nobody has touched the ball for a while or a goal was scored).

Finally, we import several libraries that decide whether to give a positive or negative reward based on actions the agent takes.

Defining the Neural Network

    try:
model = PPO.load(
"models/exit_save.zip",
env,
device="auto",
custom_objects={"n_envs": env.num_envs},
)
print("Loaded previous exit save.")
except:
print("No saved model found, creating new model.")
from torch.nn import Tanh
policy_kwargs = dict(
activation_fn=Tanh,
net_arch=[512, 512, dict(pi=[256, 256, 256], vf=[256, 256, 256])],
)

model = PPO(
MlpPolicy,
env,
n_epochs=10,
policy_kwargs=policy_kwargs,
learning_rate=5e-5,
ent_coef=0.01,
vf_coef=1.,
gamma=gamma,
verbose=3,
batch_size=batch_size,
n_steps=steps,
tensorboard_log="logs",
device="auto"
)

In this piece of code, we’re first checking if we already have a saved model. If we do, then we load it and continue training with that.

If we don’t, we first define a neural network architecture with the Tanh activation function. Then, we define a model as a PPO model (function from Stable Baselines 3). We give it a policy, environment, number of epochs, policy arguments, etc.

Training Starts

    try:
mmr_model_target_count = model.num_timesteps + mmr_save_frequency
while True:
model.learn(training_interval, callback=callback, reset_num_timesteps=False)
model.save("models/exit_save")
if model.num_timesteps >= mmr_model_target_count:
model.save(f"mmr_models/{model.num_timesteps}")
mmr_model_target_count += mmr_save_frequency

except KeyboardInterrupt:
print("Exiting training")

The function model.learn() is essentially where everything comes together to start training. After that is done, we make an exit save, or a copy of the model’s last policy onto the computer’s drive (this is the save that is trying to be found at the start).

Once you’re done training the agent, you can do Ctrl+C to terminate the command in a terminal.

Training

I ended up training this agent for 50 million steps. While that is a lot, remember that the best bots out there, like Ripple — a bot currently being trained by RLGym — have likely been training for much longer. I’ll split this section into two parts, where I tell you what my agent learned, then tell you what Ripple can do.

My Rocket League Agent

Here are some observations of what I’ve seen while viewing the training:

0–10M steps: The agent is moving around mindlessly, and not heading toward the ball yet.

20M-30M steps: The agent starts heading toward the ball, but then stops just before it hits the ball. I’ve made my own theory on why this could be happening, and it all has to do with the reward:

In the combined reward (or total reward), the agent gets rewarded for the following all the time:

  • Its velocity toward the ball
  • The ball’s velocity toward the goal

Then, it has some conditional rewards:

  • +100 if their team scores a goal
  • -100 if their team concedes a goal
  • +5 if they make a shot on target
  • +30 if they make a save
  • +10 if they make a demolition (when they destroy a car on the opposite team)

The main reason that the agent decides to drive extremely fast toward the ball but then brake before hitting it has to do with the fact that it doesn’t know to take a larger, delayed reward rather than an instant, smaller reward.

The smaller, instant reward would refer to the car’s velocity toward the ball, and the larger, delayed reward would be hitting the ball into the goal.

At this stage, the agent understands that when it hits the ball, its reward for velocity toward the ball will be greatly reduced, as the agent and the ball will both be moving in the same direction. It also begins to understand when the environment resets due to it and the opponent not hitting the ball.

This way, it stops close to when the environment will reset so that it doesn’t get negative or 0 rewards for not moving toward the ball.

That’s my explanation for its behaviour, but if you have a different idea, feel free to reply to this post!

40M-50M steps: The agent starts going towards the ball and trying to hit it into the goal, although it does sometimes behave like how it did from 20M-30M steps. It does end up making a lot more goals now.

Throughout the 50 million steps, one thing I noticed was that whenever it obtained some boost to use, it used it pretty much right away rather than conserving it for crucial moments in the game.

You probably saw that when writing these, there were some number ranges that I didn’t cover (10M-20M, 30M-40M). I didn't include these ranges because each of these ranges was a “gray” area that included behaviours from both of its surrounding ranges.

Ripple

The reason I talked about my agent’s poor boost management in the first section was that Ripple is able to manage its boost very well, and uses it at the right times. That just goes to show you how much training time has been put into making this bot what it is now.

(If you want, you can go see Ripple being trained right now)

This bot is being trained in 1-vs-1, 2-vs-2, and 3-vs-3 scenarios (each scenario is decided randomly to give the bot experience with all scenarios).

From what I’ve seen so far, Ripple is able to use good positioning tactics while making sure they maintain its momentum. While it does know some mechanics like flip cancelling, it does occasionally make unnecessary jumps and flips, so clearly it is not too good yet. As of the time that I’m writing this article, Ripple has been training for about 2 days and 6 hours and has about 340 days of in-game training.

Code Demo (Video)

Final Thoughts

Although having to stop my original idea was a major setback, I still loved making this project and looking at how the agent being trained progresses over time.

Next, I want to make more reinforcement learning projects. I’ll update you when I do.

Until then, thank you for reading this article, and I’ll catch you in the next one!

Extra Resources

I highly suggest you check these resources out (especially the last two!). They helped make this article possible

--

--