File size: 3,150 Bytes
8c8a945
 
5e04531
8c8a945
5e04531
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
f2ab607
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
---
license: openrail
pipeline_tag: reinforcement-learning
---

# EfficientZero Remastered

This repo contains the pre-trained models for the EfficientZero Remastered
project from Gigglebit Studios, a project to stabilize the training process
for the state of the art EfficientZero model.

* [Training source code](https://github.com/steventrouble/EfficientZero)
* [About the project](https://www.gigglebit.net/blog/efficientzero.html)
* [About EfficientZero](https://arxiv.org/abs/2111.00210)
* [About Gigglebit](https://www.gigglebit.net/)

Huge thanks to [Stability AI](https://stability.ai/) for providing the compute
for this project!

---

## How to use these files

Download the model that you want to test, then run test.py to test the model.

_Note: We've only productionized the training process. If you want to use these
for inference in production, you'll need to write your own inference logic.
If you do, send us a PR and we'll add it to the repo!_

Files are labeled as follows:

```
{gym_env}-s{seed}-e{env_steps}-t{train_steps}
```

Where:
*   `gym_env`: The string ID of the gym environment this model was trained on.
    E.g. Breakout-v5
*   `seed`: The seed that was used to train this model. Usually 0.
*   `env_steps`: The total number of steps in the environment that this model
    observed, usually 100k.
*   `train_steps`: The total number of training epochs the model underwent.

Note that `env_steps` can differ from `train_steps` because the model can
continue fine-tuning using its replay buffer. In the paper, the last 20k
epochs are done in this manner. This isn't necessary outside of benchmarks
and in theory better performance should be attainable by getting more samples
from the env.

---

## Findings

Our primary goal in this project was to test out EfficientZero and see its capabilities.
We were amazed by the model overall, especially on Breakout, where it far outperformed
the human baseline. The overall cost was only about $50 per fully trained model, compared
to the hundreds of thousands of dollars needed to train MuZero.

Though the trained models achieved impressive scores in Atari, they didn't reach the
stellar scores demonstrated in the paper. This could be because we used different hardware
and dependencies or because ML research papers tend to cherry-pick models and environments
to showcase good results.

Additionally, the models tended to hit a performance wall between 75-100k steps. While we
don't have enough data to know why or how often this happens, it's not surprising: the model
was tuned specifically for data efficiency, so it hasn't been tested at larger scales. A
model like MuZero might be more appropriate if you have a large budget.

Training times seemed longer than those reported in the EfficientZero paper. The paper
stated that they could train a model to completion in 7 hours, while in practice, we've found
that it takes an A100 with 32 cores between 1 to 2 days to train a model to completion. This
is likely because the training process uses more CPU than other models and therefore does not
perform well on the low-frequency, many-core CPUs found in GPU clusters.