|
--- |
|
license: openrail |
|
pipeline_tag: reinforcement-learning |
|
--- |
|
|
|
# EfficientZero Remastered |
|
|
|
This repo contains the pre-trained models for the EfficientZero Remastered |
|
project from Gigglebit Studios, a project to stabilize the training process |
|
for the state of the art EfficientZero model. |
|
|
|
* [Training source code](https://github.com/steventrouble/EfficientZero) |
|
* [About the project](https://www.gigglebit.net/blog/efficientzero.html) |
|
* [About EfficientZero](https://arxiv.org/abs/2111.00210) |
|
* [About Gigglebit](https://www.gigglebit.net/) |
|
|
|
Huge thanks to [Stability AI](https://stability.ai/) for providing the compute |
|
for this project! |
|
|
|
--- |
|
|
|
## How to use these files |
|
|
|
Download the model that you want to test, then run test.py to test the model. |
|
|
|
_Note: We've only productionized the training process. If you want to use these |
|
for inference in production, you'll need to write your own inference logic. |
|
If you do, send us a PR and we'll add it to the repo!_ |
|
|
|
Files are labeled as follows: |
|
|
|
``` |
|
{gym_env}-s{seed}-e{env_steps}-t{train_steps} |
|
``` |
|
|
|
Where: |
|
* `gym_env`: The string ID of the gym environment this model was trained on. |
|
E.g. Breakout-v5 |
|
* `seed`: The seed that was used to train this model. Usually 0. |
|
* `env_steps`: The total number of steps in the environment that this model |
|
observed, usually 100k. |
|
* `train_steps`: The total number of training epochs the model underwent. |
|
|
|
Note that `env_steps` can differ from `train_steps` because the model can |
|
continue fine-tuning using its replay buffer. In the paper, the last 20k |
|
epochs are done in this manner. This isn't necessary outside of benchmarks |
|
and in theory better performance should be attainable by getting more samples |
|
from the env. |
|
|
|
--- |
|
|
|
## Findings |
|
|
|
Our primary goal in this project was to test out EfficientZero and see its capabilities. |
|
We were amazed by the model overall, especially on Breakout, where it far outperformed |
|
the human baseline. The overall cost was only about $50 per fully trained model, compared |
|
to the hundreds of thousands of dollars needed to train MuZero. |
|
|
|
Though the trained models achieved impressive scores in Atari, they didn't reach the |
|
stellar scores demonstrated in the paper. This could be because we used different hardware |
|
and dependencies or because ML research papers tend to cherry-pick models and environments |
|
to showcase good results. |
|
|
|
Additionally, the models tended to hit a performance wall between 75-100k steps. While we |
|
don't have enough data to know why or how often this happens, it's not surprising: the model |
|
was tuned specifically for data efficiency, so it hasn't been tested at larger scales. A |
|
model like MuZero might be more appropriate if you have a large budget. |
|
|
|
Training times seemed longer than those reported in the EfficientZero paper. The paper |
|
stated that they could train a model to completion in 7 hours, while in practice, we've found |
|
that it takes an A100 with 32 cores between 1 to 2 days to train a model to completion. This |
|
is likely because the training process uses more CPU than other models and therefore does not |
|
perform well on the low-frequency, many-core CPUs found in GPU clusters. |