Gyaneshere commited on
Commit
0e4ebca
·
verified ·
1 Parent(s): 931d76c

Upload PPO.ipynb

Browse files

Proximal Policy Optimization Notebook

Files changed (1) hide show
  1. PPO.ipynb +1202 -0
PPO.ipynb ADDED
@@ -0,0 +1,1202 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ {
2
+ "cells": [
3
+ {
4
+ "cell_type": "markdown",
5
+ "metadata": {
6
+ "id": "njb_ProuHiOe"
7
+ },
8
+ "source": [
9
+ "# Unit 1: Train your first Deep Reinforcement Learning Agent 🤖\n",
10
+ "\n",
11
+ "![Cover](https://huggingface.co/datasets/huggingface-deep-rl-course/course-images/resolve/main/en/unit1/thumbnail.jpg)\n",
12
+ "\n",
13
+ "In this notebook, you'll train your **first Deep Reinforcement Learning agent** a Lunar Lander agent that will learn to **land correctly on the Moon 🌕**. Using [Stable-Baselines3](https://stable-baselines3.readthedocs.io/en/master/) a Deep Reinforcement Learning library, share them with the community, and experiment with different configurations\n",
14
+ "\n",
15
+ "⬇️ Here is an example of what **you will achieve in just a couple of minutes.** ⬇️\n",
16
+ "\n",
17
+ "\n"
18
+ ]
19
+ },
20
+ {
21
+ "cell_type": "code",
22
+ "execution_count": null,
23
+ "metadata": {
24
+ "id": "PF46MwbZD00b"
25
+ },
26
+ "outputs": [],
27
+ "source": [
28
+ "%%html\n",
29
+ "<video controls autoplay><source src=\"https://huggingface.co/sb3/ppo-LunarLander-v2/resolve/main/replay.mp4\" type=\"video/mp4\"></video>"
30
+ ]
31
+ },
32
+ {
33
+ "cell_type": "markdown",
34
+ "metadata": {
35
+ "id": "x7oR6R-ZIbeS"
36
+ },
37
+ "source": [
38
+ "### The environment 🎮\n",
39
+ "\n",
40
+ "- [LunarLander-v2](https://gymnasium.farama.org/environments/box2d/lunar_lander/)\n",
41
+ "\n",
42
+ "### The library used 📚\n",
43
+ "\n",
44
+ "- [Stable-Baselines3](https://stable-baselines3.readthedocs.io/en/master/)"
45
+ ]
46
+ },
47
+ {
48
+ "cell_type": "markdown",
49
+ "metadata": {
50
+ "id": "OwEcFHe9RRZW"
51
+ },
52
+ "source": [
53
+ "We're constantly trying to improve our tutorials, so **if you find some issues in this notebook**, please [open an issue on the Github Repo](https://github.com/huggingface/deep-rl-class/issues)."
54
+ ]
55
+ },
56
+ {
57
+ "cell_type": "markdown",
58
+ "metadata": {
59
+ "id": "4i6tjI2tHQ8j"
60
+ },
61
+ "source": [
62
+ "## Objectives of this notebook 🏆\n",
63
+ "\n",
64
+ "At the end of the notebook, you will:\n",
65
+ "\n",
66
+ "- Be able to use **Gymnasium**, the environment library.\n",
67
+ "- Be able to use **Stable-Baselines3**, the deep reinforcement learning library.\n",
68
+ "- Be able to **push your trained agent to the Hub** with a nice video replay and an evaluation score 🔥.\n",
69
+ "\n",
70
+ "\n"
71
+ ]
72
+ },
73
+ {
74
+ "cell_type": "markdown",
75
+ "metadata": {
76
+ "id": "Ff-nyJdzJPND"
77
+ },
78
+ "source": [
79
+ "## This notebook is from Deep Reinforcement Learning Course\n",
80
+ "\n",
81
+ "<img src=\"https://huggingface.co/datasets/huggingface-deep-rl-course/course-images/resolve/main/en/notebooks/deep-rl-course-illustration.jpg\" alt=\"Deep RL Course illustration\"/>"
82
+ ]
83
+ },
84
+ {
85
+ "cell_type": "markdown",
86
+ "metadata": {
87
+ "id": "6p5HnEefISCB"
88
+ },
89
+ "source": [
90
+ "In this free course, you will:\n",
91
+ "\n",
92
+ "- 📖 Study Deep Reinforcement Learning in **theory and practice**.\n",
93
+ "- 🧑‍💻 Learn to **use famous Deep RL libraries** such as Stable Baselines3, RL Baselines3 Zoo, CleanRL and Sample Factory 2.0.\n",
94
+ "- 🤖 Train **agents in unique environments**\n",
95
+ "- 🎓 **Earn a certificate of completion** by completing 80% of the assignments.\n",
96
+ "\n",
97
+ "And more!\n",
98
+ "\n",
99
+ "Check 📚 the syllabus 👉 https://simoninithomas.github.io/deep-rl-course\n",
100
+ "\n",
101
+ "Don’t forget to **<a href=\"http://eepurl.com/ic5ZUD\">sign up to the course</a>** (we are collecting your email to be able to **send you the links when each Unit is published and give you information about the challenges and updates).**\n",
102
+ "\n",
103
+ "The best way to keep in touch and ask questions is **to join our discord server** to exchange with the community and with us 👉🏻 https://discord.gg/ydHrjt3WP5"
104
+ ]
105
+ },
106
+ {
107
+ "cell_type": "markdown",
108
+ "metadata": {
109
+ "id": "Y-mo_6rXIjRi"
110
+ },
111
+ "source": [
112
+ "## Prerequisites 🏗️\n",
113
+ "\n",
114
+ "Before diving into the notebook, you need to:\n",
115
+ "\n",
116
+ "🔲 📝 **[Read Unit 0](https://huggingface.co/deep-rl-course/unit0/introduction)** that gives you all the **information about the course and helps you to onboard** 🤗\n",
117
+ "\n",
118
+ "🔲 📚 **Develop an understanding of the foundations of Reinforcement learning** (RL process, Rewards hypothesis...) by [reading Unit 1](https://huggingface.co/deep-rl-course/unit1/introduction)."
119
+ ]
120
+ },
121
+ {
122
+ "cell_type": "markdown",
123
+ "metadata": {
124
+ "id": "HoeqMnr5LuYE"
125
+ },
126
+ "source": [
127
+ "## A small recap of Deep Reinforcement Learning 📚\n",
128
+ "\n",
129
+ "<img src=\"https://huggingface.co/datasets/huggingface-deep-rl-course/course-images/resolve/main/en/unit1/RL_process_game.jpg\" alt=\"The RL process\" width=\"100%\">"
130
+ ]
131
+ },
132
+ {
133
+ "cell_type": "markdown",
134
+ "metadata": {
135
+ "id": "xcQYx9ynaFMD"
136
+ },
137
+ "source": [
138
+ "Let's do a small recap on what we learned in the first Unit:\n",
139
+ "\n",
140
+ "- Reinforcement Learning is a **computational approach to learning from actions**. We build an agent that learns from the environment by **interacting with it through trial and error** and receiving rewards (negative or positive) as feedback.\n",
141
+ "\n",
142
+ "- The goal of any RL agent is to **maximize its expected cumulative reward** (also called expected return) because RL is based on the _reward hypothesis_, which is that all goals can be described as the maximization of an expected cumulative reward.\n",
143
+ "\n",
144
+ "- The RL process is a **loop that outputs a sequence of state, action, reward, and next state**.\n",
145
+ "\n",
146
+ "- To calculate the expected cumulative reward (expected return), **we discount the rewards**: the rewards that come sooner (at the beginning of the game) are more probable to happen since they are more predictable than the long-term future reward.\n",
147
+ "\n",
148
+ "- To solve an RL problem, you want to **find an optimal policy**; the policy is the \"brain\" of your AI that will tell us what action to take given a state. The optimal one is the one that gives you the actions that max the expected return.\n",
149
+ "\n",
150
+ "There are **two** ways to find your optimal policy:\n",
151
+ "\n",
152
+ "- By **training your policy directly**: policy-based methods.\n",
153
+ "- By **training a value function** that tells us the expected return the agent will get at each state and use this function to define our policy: value-based methods.\n",
154
+ "\n",
155
+ "- Finally, we spoke about Deep RL because **we introduce deep neural networks to estimate the action to take (policy-based) or to estimate the value of a state (value-based) hence the name \"deep.\"**"
156
+ ]
157
+ },
158
+ {
159
+ "cell_type": "markdown",
160
+ "metadata": {
161
+ "id": "qDploC3jSH99"
162
+ },
163
+ "source": [
164
+ "# Let's train our first Deep Reinforcement Learning agent and upload it to the Hub 🚀\n",
165
+ "\n",
166
+ "## Get a certificate 🎓\n",
167
+ "\n",
168
+ "To validate this hands-on for the [certification process](https://huggingface.co/deep-rl-course/en/unit0/introduction#certification-process), you need to push your trained model to the Hub and **get a result of >= 200**.\n",
169
+ "\n",
170
+ "To find your result, go to the [leaderboard](https://huggingface.co/spaces/huggingface-projects/Deep-Reinforcement-Learning-Leaderboard) and find your model, **the result = mean_reward - std of reward**\n",
171
+ "\n",
172
+ "For more information about the certification process, check this section 👉 https://huggingface.co/deep-rl-course/en/unit0/introduction#certification-process"
173
+ ]
174
+ },
175
+ {
176
+ "cell_type": "markdown",
177
+ "metadata": {
178
+ "id": "HqzznTzhNfAC"
179
+ },
180
+ "source": [
181
+ "## Set the GPU 💪\n",
182
+ "\n",
183
+ "- To **accelerate the agent's training, we'll use a GPU**. To do that, go to `Runtime > Change Runtime type`\n",
184
+ "\n",
185
+ "<img src=\"https://huggingface.co/datasets/huggingface-deep-rl-course/course-images/resolve/main/en/notebooks/gpu-step1.jpg\" alt=\"GPU Step 1\">"
186
+ ]
187
+ },
188
+ {
189
+ "cell_type": "markdown",
190
+ "metadata": {
191
+ "id": "38HBd3t1SHJ8"
192
+ },
193
+ "source": [
194
+ "- `Hardware Accelerator > GPU`\n",
195
+ "\n",
196
+ "<img src=\"https://huggingface.co/datasets/huggingface-deep-rl-course/course-images/resolve/main/en/notebooks/gpu-step2.jpg\" alt=\"GPU Step 2\">"
197
+ ]
198
+ },
199
+ {
200
+ "cell_type": "markdown",
201
+ "metadata": {
202
+ "id": "jeDAH0h0EBiG"
203
+ },
204
+ "source": [
205
+ "## Install dependencies and create a virtual screen 🔽\n",
206
+ "\n",
207
+ "The first step is to install the dependencies, we’ll install multiple ones.\n",
208
+ "\n",
209
+ "- `gymnasium[box2d]`: Contains the LunarLander-v2 environment 🌛\n",
210
+ "- `stable-baselines3[extra]`: The deep reinforcement learning library.\n",
211
+ "- `huggingface_sb3`: Additional code for Stable-baselines3 to load and upload models from the Hugging Face 🤗 Hub.\n",
212
+ "\n",
213
+ "To make things easier, we created a script to install all these dependencies."
214
+ ]
215
+ },
216
+ {
217
+ "cell_type": "code",
218
+ "execution_count": null,
219
+ "metadata": {
220
+ "id": "yQIGLPDkGhgG"
221
+ },
222
+ "outputs": [],
223
+ "source": [
224
+ "!apt install swig cmake"
225
+ ]
226
+ },
227
+ {
228
+ "cell_type": "code",
229
+ "execution_count": null,
230
+ "metadata": {
231
+ "id": "9XaULfDZDvrC"
232
+ },
233
+ "outputs": [],
234
+ "source": [
235
+ "!pip install -r https://raw.githubusercontent.com/huggingface/deep-rl-class/main/notebooks/unit1/requirements-unit1.txt"
236
+ ]
237
+ },
238
+ {
239
+ "cell_type": "markdown",
240
+ "metadata": {
241
+ "id": "BEKeXQJsQCYm"
242
+ },
243
+ "source": [
244
+ "During the notebook, we'll need to generate a replay video. To do so, with colab, **we need to have a virtual screen to be able to render the environment** (and thus record the frames).\n",
245
+ "\n",
246
+ "Hence the following cell will install virtual screen libraries and create and run a virtual screen 🖥"
247
+ ]
248
+ },
249
+ {
250
+ "cell_type": "code",
251
+ "execution_count": null,
252
+ "metadata": {
253
+ "id": "j5f2cGkdP-mb"
254
+ },
255
+ "outputs": [],
256
+ "source": [
257
+ "!sudo apt-get update\n",
258
+ "!sudo apt-get install -y python3-opengl\n",
259
+ "!apt install ffmpeg\n",
260
+ "!apt install xvfb\n",
261
+ "!pip3 install pyvirtualdisplay"
262
+ ]
263
+ },
264
+ {
265
+ "cell_type": "markdown",
266
+ "metadata": {
267
+ "id": "TCwBTAwAW9JJ"
268
+ },
269
+ "source": [
270
+ "To make sure the new installed libraries are used, **sometimes it's required to restart the notebook runtime**. The next cell will force the **runtime to crash, so you'll need to connect again and run the code starting from here**. Thanks to this trick, **we will be able to run our virtual screen.**"
271
+ ]
272
+ },
273
+ {
274
+ "cell_type": "code",
275
+ "execution_count": null,
276
+ "metadata": {
277
+ "id": "cYvkbef7XEMi"
278
+ },
279
+ "outputs": [],
280
+ "source": [
281
+ "import os\n",
282
+ "os.kill(os.getpid(), 9)"
283
+ ]
284
+ },
285
+ {
286
+ "cell_type": "code",
287
+ "execution_count": null,
288
+ "metadata": {
289
+ "id": "BE5JWP5rQIKf"
290
+ },
291
+ "outputs": [],
292
+ "source": [
293
+ "# Virtual display\n",
294
+ "from pyvirtualdisplay import Display\n",
295
+ "\n",
296
+ "virtual_display = Display(visible=0, size=(1400, 900))\n",
297
+ "virtual_display.start()"
298
+ ]
299
+ },
300
+ {
301
+ "cell_type": "markdown",
302
+ "metadata": {
303
+ "id": "wrgpVFqyENVf"
304
+ },
305
+ "source": [
306
+ "## Import the packages 📦\n",
307
+ "\n",
308
+ "One additional library we import is huggingface_hub **to be able to upload and download trained models from the hub**.\n",
309
+ "\n",
310
+ "\n",
311
+ "The Hugging Face Hub 🤗 works as a central place where anyone can share and explore models and datasets. It has versioning, metrics, visualizations and other features that will allow you to easily collaborate with others.\n",
312
+ "\n",
313
+ "You can see here all the Deep reinforcement Learning models available here👉 https://huggingface.co/models?pipeline_tag=reinforcement-learning&sort=downloads\n",
314
+ "\n"
315
+ ]
316
+ },
317
+ {
318
+ "cell_type": "code",
319
+ "execution_count": null,
320
+ "metadata": {
321
+ "id": "cygWLPGsEQ0m"
322
+ },
323
+ "outputs": [],
324
+ "source": [
325
+ "import gymnasium\n",
326
+ "\n",
327
+ "from huggingface_sb3 import load_from_hub, package_to_hub\n",
328
+ "from huggingface_hub import notebook_login # To log to our Hugging Face account to be able to upload models to the Hub.\n",
329
+ "\n",
330
+ "from stable_baselines3 import PPO\n",
331
+ "from stable_baselines3.common.env_util import make_vec_env\n",
332
+ "from stable_baselines3.common.evaluation import evaluate_policy\n",
333
+ "from stable_baselines3.common.monitor import Monitor"
334
+ ]
335
+ },
336
+ {
337
+ "cell_type": "markdown",
338
+ "metadata": {
339
+ "id": "MRqRuRUl8CsB"
340
+ },
341
+ "source": [
342
+ "## Understand Gymnasium and how it works 🤖\n",
343
+ "\n",
344
+ "🏋 The library containing our environment is called Gymnasium.\n",
345
+ "**You'll use Gymnasium a lot in Deep Reinforcement Learning.**\n",
346
+ "\n",
347
+ "Gymnasium is the **new version of Gym library** [maintained by the Farama Foundation](https://farama.org/).\n",
348
+ "\n",
349
+ "The Gymnasium library provides two things:\n",
350
+ "\n",
351
+ "- An interface that allows you to **create RL environments**.\n",
352
+ "- A **collection of environments** (gym-control, atari, box2D...).\n",
353
+ "\n",
354
+ "Let's look at an example, but first let's recall the RL loop.\n",
355
+ "\n",
356
+ "<img src=\"https://huggingface.co/datasets/huggingface-deep-rl-course/course-images/resolve/main/en/unit1/RL_process_game.jpg\" alt=\"The RL process\" width=\"100%\">"
357
+ ]
358
+ },
359
+ {
360
+ "cell_type": "markdown",
361
+ "metadata": {
362
+ "id": "-TzNN0bQ_j-3"
363
+ },
364
+ "source": [
365
+ "At each step:\n",
366
+ "- Our Agent receives a **state (S0)** from the **Environment** — we receive the first frame of our game (Environment).\n",
367
+ "- Based on that **state (S0),** the Agent takes an **action (A0)** — our Agent will move to the right.\n",
368
+ "- The environment transitions to a **new** **state (S1)** — new frame.\n",
369
+ "- The environment gives some **reward (R1)** to the Agent — we’re not dead *(Positive Reward +1)*.\n",
370
+ "\n",
371
+ "\n",
372
+ "With Gymnasium:\n",
373
+ "\n",
374
+ "1️⃣ We create our environment using `gymnasium.make()`\n",
375
+ "\n",
376
+ "2️⃣ We reset the environment to its initial state with `observation = env.reset()`\n",
377
+ "\n",
378
+ "At each step:\n",
379
+ "\n",
380
+ "3️⃣ Get an action using our model (in our example we take a random action)\n",
381
+ "\n",
382
+ "4️⃣ Using `env.step(action)`, we perform this action in the environment and get\n",
383
+ "- `observation`: The new state (st+1)\n",
384
+ "- `reward`: The reward we get after executing the action\n",
385
+ "- `terminated`: Indicates if the episode terminated (agent reach the terminal state)\n",
386
+ "- `truncated`: Introduced with this new version, it indicates a timelimit or if an agent go out of bounds of the environment for instance.\n",
387
+ "- `info`: A dictionary that provides additional information (depends on the environment).\n",
388
+ "\n",
389
+ "For more explanations check this 👉 https://gymnasium.farama.org/api/env/#gymnasium.Env.step\n",
390
+ "\n",
391
+ "If the episode is terminated:\n",
392
+ "- We reset the environment to its initial state with `observation = env.reset()`\n",
393
+ "\n",
394
+ "**Let's look at an example!** Make sure to read the code\n"
395
+ ]
396
+ },
397
+ {
398
+ "cell_type": "code",
399
+ "execution_count": null,
400
+ "metadata": {
401
+ "id": "w7vOFlpA_ONz"
402
+ },
403
+ "outputs": [],
404
+ "source": [
405
+ "import gymnasium as gym\n",
406
+ "\n",
407
+ "# First, we create our environment called LunarLander-v2\n",
408
+ "env = gym.make(\"LunarLander-v2\")\n",
409
+ "\n",
410
+ "# Then we reset this environment\n",
411
+ "observation, info = env.reset()\n",
412
+ "\n",
413
+ "for _ in range(20):\n",
414
+ " # Take a random action\n",
415
+ " action = env.action_space.sample()\n",
416
+ " print(\"Action taken:\", action)\n",
417
+ "\n",
418
+ " # Do this action in the environment and get\n",
419
+ " # next_state, reward, terminated, truncated and info\n",
420
+ " observation, reward, terminated, truncated, info = env.step(action)\n",
421
+ "\n",
422
+ " # If the game is terminated (in our case we land, crashed) or truncated (timeout)\n",
423
+ " if terminated or truncated:\n",
424
+ " # Reset the environment\n",
425
+ " print(\"Environment is reset\")\n",
426
+ " observation, info = env.reset()\n",
427
+ "\n",
428
+ "env.close()"
429
+ ]
430
+ },
431
+ {
432
+ "cell_type": "markdown",
433
+ "metadata": {
434
+ "id": "XIrKGGSlENZB"
435
+ },
436
+ "source": [
437
+ "## Create the LunarLander environment 🌛 and understand how it works\n",
438
+ "\n",
439
+ "### [The environment 🎮](https://gymnasium.farama.org/environments/box2d/lunar_lander/)\n",
440
+ "\n",
441
+ "In this first tutorial, we’re going to train our agent, a [Lunar Lander](https://gymnasium.farama.org/environments/box2d/lunar_lander/), **to land correctly on the moon**. To do that, the agent needs to learn **to adapt its speed and position (horizontal, vertical, and angular) to land correctly.**\n",
442
+ "\n",
443
+ "---\n",
444
+ "\n",
445
+ "\n",
446
+ "💡 A good habit when you start to use an environment is to check its documentation\n",
447
+ "\n",
448
+ "👉 https://gymnasium.farama.org/environments/box2d/lunar_lander/\n",
449
+ "\n",
450
+ "---\n"
451
+ ]
452
+ },
453
+ {
454
+ "cell_type": "markdown",
455
+ "metadata": {
456
+ "id": "poLBgRocF9aT"
457
+ },
458
+ "source": [
459
+ "Let's see what the Environment looks like:\n"
460
+ ]
461
+ },
462
+ {
463
+ "cell_type": "code",
464
+ "execution_count": null,
465
+ "metadata": {
466
+ "id": "ZNPG0g_UGCfh"
467
+ },
468
+ "outputs": [],
469
+ "source": [
470
+ "# We create our environment with gym.make(\"<name_of_the_environment>\")\n",
471
+ "env = gym.make(\"LunarLander-v2\")\n",
472
+ "env.reset()\n",
473
+ "print(\"_____OBSERVATION SPACE_____ \\n\")\n",
474
+ "print(\"Observation Space Shape\", env.observation_space.shape)\n",
475
+ "print(\"Sample observation\", env.observation_space.sample()) # Get a random observation"
476
+ ]
477
+ },
478
+ {
479
+ "cell_type": "markdown",
480
+ "metadata": {
481
+ "id": "2MXc15qFE0M9"
482
+ },
483
+ "source": [
484
+ "We see with `Observation Space Shape (8,)` that the observation is a vector of size 8, where each value contains different information about the lander:\n",
485
+ "- Horizontal pad coordinate (x)\n",
486
+ "- Vertical pad coordinate (y)\n",
487
+ "- Horizontal speed (x)\n",
488
+ "- Vertical speed (y)\n",
489
+ "- Angle\n",
490
+ "- Angular speed\n",
491
+ "- If the left leg contact point has touched the land (boolean)\n",
492
+ "- If the right leg contact point has touched the land (boolean)\n"
493
+ ]
494
+ },
495
+ {
496
+ "cell_type": "code",
497
+ "execution_count": null,
498
+ "metadata": {
499
+ "id": "We5WqOBGLoSm"
500
+ },
501
+ "outputs": [],
502
+ "source": [
503
+ "print(\"\\n _____ACTION SPACE_____ \\n\")\n",
504
+ "print(\"Action Space Shape\", env.action_space.n)\n",
505
+ "print(\"Action Space Sample\", env.action_space.sample()) # Take a random action"
506
+ ]
507
+ },
508
+ {
509
+ "cell_type": "markdown",
510
+ "metadata": {
511
+ "id": "MyxXwkI2Magx"
512
+ },
513
+ "source": [
514
+ "The action space (the set of possible actions the agent can take) is discrete with 4 actions available 🎮:\n",
515
+ "\n",
516
+ "- Action 0: Do nothing,\n",
517
+ "- Action 1: Fire left orientation engine,\n",
518
+ "- Action 2: Fire the main engine,\n",
519
+ "- Action 3: Fire right orientation engine.\n",
520
+ "\n",
521
+ "Reward function (the function that will give a reward at each timestep) 💰:\n",
522
+ "\n",
523
+ "After every step a reward is granted. The total reward of an episode is the **sum of the rewards for all the steps within that episode**.\n",
524
+ "\n",
525
+ "For each step, the reward:\n",
526
+ "\n",
527
+ "- Is increased/decreased the closer/further the lander is to the landing pad.\n",
528
+ "- Is increased/decreased the slower/faster the lander is moving.\n",
529
+ "- Is decreased the more the lander is tilted (angle not horizontal).\n",
530
+ "- Is increased by 10 points for each leg that is in contact with the ground.\n",
531
+ "- Is decreased by 0.03 points each frame a side engine is firing.\n",
532
+ "- Is decreased by 0.3 points each frame the main engine is firing.\n",
533
+ "\n",
534
+ "The episode receive an **additional reward of -100 or +100 points for crashing or landing safely respectively.**\n",
535
+ "\n",
536
+ "An episode is **considered a solution if it scores at least 200 points.**"
537
+ ]
538
+ },
539
+ {
540
+ "cell_type": "markdown",
541
+ "metadata": {
542
+ "id": "dFD9RAFjG8aq"
543
+ },
544
+ "source": [
545
+ "#### Vectorized Environment\n",
546
+ "\n",
547
+ "- We create a vectorized environment (a method for stacking multiple independent environments into a single environment) of 16 environments, this way, **we'll have more diverse experiences during the training.**"
548
+ ]
549
+ },
550
+ {
551
+ "cell_type": "code",
552
+ "execution_count": null,
553
+ "metadata": {
554
+ "id": "99hqQ_etEy1N"
555
+ },
556
+ "outputs": [],
557
+ "source": [
558
+ "# Create the environment\n",
559
+ "env = make_vec_env('LunarLander-v2', n_envs=16)"
560
+ ]
561
+ },
562
+ {
563
+ "cell_type": "markdown",
564
+ "metadata": {
565
+ "id": "VgrE86r5E5IK"
566
+ },
567
+ "source": [
568
+ "## Create the Model 🤖\n",
569
+ "- We have studied our environment and we understood the problem: **being able to land the Lunar Lander to the Landing Pad correctly by controlling left, right and main orientation engine**. Now let's build the algorithm we're going to use to solve this Problem 🚀.\n",
570
+ "\n",
571
+ "- To do so, we're going to use our first Deep RL library, [Stable Baselines3 (SB3)](https://stable-baselines3.readthedocs.io/en/master/).\n",
572
+ "\n",
573
+ "- SB3 is a set of **reliable implementations of reinforcement learning algorithms in PyTorch**.\n",
574
+ "\n",
575
+ "---\n",
576
+ "\n",
577
+ "💡 A good habit when using a new library is to dive first on the documentation: https://stable-baselines3.readthedocs.io/en/master/ and then try some tutorials.\n",
578
+ "\n",
579
+ "----"
580
+ ]
581
+ },
582
+ {
583
+ "cell_type": "markdown",
584
+ "metadata": {
585
+ "id": "HLlClRW37Q7e"
586
+ },
587
+ "source": [
588
+ "<img src=\"https://huggingface.co/datasets/huggingface-deep-rl-course/course-images/resolve/main/en/unit1/sb3.png\" alt=\"Stable Baselines3\">"
589
+ ]
590
+ },
591
+ {
592
+ "cell_type": "markdown",
593
+ "metadata": {
594
+ "id": "HV4yiUM_9_Ka"
595
+ },
596
+ "source": [
597
+ "To solve this problem, we're going to use SB3 **PPO**. [PPO (aka Proximal Policy Optimization) is one of the SOTA (state of the art) Deep Reinforcement Learning algorithms that you'll study during this course](https://stable-baselines3.readthedocs.io/en/master/modules/ppo.html#example%5D).\n",
598
+ "\n",
599
+ "PPO is a combination of:\n",
600
+ "- *Value-based reinforcement learning method*: learning an action-value function that will tell us the **most valuable action to take given a state and action**.\n",
601
+ "- *Policy-based reinforcement learning method*: learning a policy that will **give us a probability distribution over actions**."
602
+ ]
603
+ },
604
+ {
605
+ "cell_type": "markdown",
606
+ "metadata": {
607
+ "id": "5qL_4HeIOrEJ"
608
+ },
609
+ "source": [
610
+ "Stable-Baselines3 is easy to set up:\n",
611
+ "\n",
612
+ "1️⃣ You **create your environment** (in our case it was done above)\n",
613
+ "\n",
614
+ "2️⃣ You define the **model you want to use and instantiate this model** `model = PPO(\"MlpPolicy\")`\n",
615
+ "\n",
616
+ "3️⃣ You **train the agent** with `model.learn` and define the number of training timesteps\n",
617
+ "\n",
618
+ "```\n",
619
+ "# Create environment\n",
620
+ "env = gym.make('LunarLander-v2')\n",
621
+ "\n",
622
+ "# Instantiate the agent\n",
623
+ "model = PPO('MlpPolicy', env, verbose=1)\n",
624
+ "# Train the agent\n",
625
+ "model.learn(total_timesteps=int(2e5))\n",
626
+ "```\n",
627
+ "\n"
628
+ ]
629
+ },
630
+ {
631
+ "cell_type": "code",
632
+ "execution_count": null,
633
+ "metadata": {
634
+ "id": "nxI6hT1GE4-A"
635
+ },
636
+ "outputs": [],
637
+ "source": [
638
+ "# TODO: Define a PPO MlpPolicy architecture\n",
639
+ "# We use MultiLayerPerceptron (MLPPolicy) because the input is a vector,\n",
640
+ "# if we had frames as input we would use CnnPolicy\n",
641
+ "# Create environment\n",
642
+ "env = gym.make('LunarLander-v2')\n",
643
+ "\n",
644
+ "# Instantiate the agent\n",
645
+ "model = PPO(\n",
646
+ " policy = 'MlpPolicy',\n",
647
+ " env = env,\n",
648
+ " n_steps = 1024,\n",
649
+ " batch_size = 64,\n",
650
+ " n_epochs = 4,\n",
651
+ " gamma = 0.999,\n",
652
+ " gae_lambda = 0.98,\n",
653
+ " ent_coef = 0.01,\n",
654
+ " verbose=1)\n",
655
+ "# Train the agent\n",
656
+ "model.learn(total_timesteps=int(2e5))"
657
+ ]
658
+ },
659
+ {
660
+ "cell_type": "markdown",
661
+ "metadata": {
662
+ "id": "QAN7B0_HCVZC"
663
+ },
664
+ "source": [
665
+ "#### Solution"
666
+ ]
667
+ },
668
+ {
669
+ "cell_type": "code",
670
+ "execution_count": null,
671
+ "metadata": {
672
+ "id": "543OHYDfcjK4"
673
+ },
674
+ "outputs": [],
675
+ "source": [
676
+ "# SOLUTION\n",
677
+ "# We added some parameters to accelerate the training\n",
678
+ "model = PPO(\n",
679
+ " policy = 'MlpPolicy',\n",
680
+ " env = env,\n",
681
+ " n_steps = 1024,\n",
682
+ " batch_size = 64,\n",
683
+ " n_epochs = 4,\n",
684
+ " gamma = 0.999,\n",
685
+ " gae_lambda = 0.98,\n",
686
+ " ent_coef = 0.01,\n",
687
+ " verbose=1)"
688
+ ]
689
+ },
690
+ {
691
+ "cell_type": "markdown",
692
+ "metadata": {
693
+ "id": "ClJJk88yoBUi"
694
+ },
695
+ "source": [
696
+ "## Train the PPO agent 🏃\n",
697
+ "- Let's train our agent for 1,000,000 timesteps, don't forget to use GPU on Colab. It will take approximately ~20min, but you can use fewer timesteps if you just want to try it out.\n",
698
+ "- During the training, take a ☕ break you deserved it 🤗"
699
+ ]
700
+ },
701
+ {
702
+ "cell_type": "code",
703
+ "execution_count": null,
704
+ "metadata": {
705
+ "id": "qKnYkNiVp89p"
706
+ },
707
+ "outputs": [],
708
+ "source": [
709
+ "# Train it for 1,000,000 timesteps\n",
710
+ "model.learn(total_timesteps=1000000)\n",
711
+ "# Save the model\n",
712
+ "model_name = \"ppo-LunarLander-v2\"\n",
713
+ "model.save(model_name)\n"
714
+ ]
715
+ },
716
+ {
717
+ "cell_type": "markdown",
718
+ "metadata": {
719
+ "id": "1bQzQ-QcE3zo"
720
+ },
721
+ "source": [
722
+ "#### Solution"
723
+ ]
724
+ },
725
+ {
726
+ "cell_type": "code",
727
+ "execution_count": null,
728
+ "metadata": {
729
+ "id": "poBCy9u_csyR"
730
+ },
731
+ "outputs": [],
732
+ "source": [
733
+ "# SOLUTION\n",
734
+ "# Train it for 1,000,000 timesteps\n",
735
+ "model.learn(total_timesteps=1000000)\n",
736
+ "# Save the model\n",
737
+ "model_name = \"ppo-LunarLander-v2\"\n",
738
+ "model.save(model_name)"
739
+ ]
740
+ },
741
+ {
742
+ "cell_type": "markdown",
743
+ "metadata": {
744
+ "id": "BY_HuedOoISR"
745
+ },
746
+ "source": [
747
+ "## Evaluate the agent 📈\n",
748
+ "- Remember to wrap the environment in a [Monitor](https://stable-baselines3.readthedocs.io/en/master/common/monitor.html).\n",
749
+ "- Now that our Lunar Lander agent is trained 🚀, we need to **check its performance**.\n",
750
+ "- Stable-Baselines3 provides a method to do that: `evaluate_policy`.\n",
751
+ "- To fill that part you need to [check the documentation](https://stable-baselines3.readthedocs.io/en/master/guide/examples.html#basic-usage-training-saving-loading)\n",
752
+ "- In the next step, we'll see **how to automatically evaluate and share your agent to compete in a leaderboard, but for now let's do it ourselves**\n",
753
+ "\n",
754
+ "\n",
755
+ "💡 When you evaluate your agent, you should not use your training environment but create an evaluation environment."
756
+ ]
757
+ },
758
+ {
759
+ "cell_type": "code",
760
+ "execution_count": null,
761
+ "metadata": {
762
+ "id": "yRpno0glsADy"
763
+ },
764
+ "outputs": [],
765
+ "source": [
766
+ "eval_env = Monitor(gym.make(\"LunarLander-v2\", render_mode='rgb_array'))\n",
767
+ "mean_reward, std_reward = evaluate_policy(model, eval_env, n_eval_episodes=10, deterministic=True)\n",
768
+ "print(f\"mean_reward={mean_reward:.2f} +/- {std_reward}\")\n",
769
+ "\n"
770
+ ]
771
+ },
772
+ {
773
+ "cell_type": "markdown",
774
+ "metadata": {
775
+ "id": "BqPKw3jt_pG5"
776
+ },
777
+ "source": [
778
+ "#### Solution"
779
+ ]
780
+ },
781
+ {
782
+ "cell_type": "code",
783
+ "execution_count": null,
784
+ "metadata": {
785
+ "id": "zpz8kHlt_a_m"
786
+ },
787
+ "outputs": [],
788
+ "source": [
789
+ "#@title\n",
790
+ "eval_env = Monitor(gym.make(\"LunarLander-v2\", render_mode='rgb_array'))\n",
791
+ "mean_reward, std_reward = evaluate_policy(model, eval_env, n_eval_episodes=10, deterministic=True)\n",
792
+ "print(f\"mean_reward={mean_reward:.2f} +/- {std_reward}\")"
793
+ ]
794
+ },
795
+ {
796
+ "cell_type": "markdown",
797
+ "metadata": {
798
+ "id": "reBhoODwcXfr"
799
+ },
800
+ "source": [
801
+ "- In my case, I got a mean reward of `200.20 +/- 20.80` after training for 1 million steps, which means that our lunar lander agent is ready to land on the moon 🌛🥳."
802
+ ]
803
+ },
804
+ {
805
+ "cell_type": "markdown",
806
+ "metadata": {
807
+ "id": "IK_kR78NoNb2"
808
+ },
809
+ "source": [
810
+ "## Publish our trained model on the Hub 🔥\n",
811
+ "Now that we saw we got good results after the training, we can publish our trained model on the hub 🤗 with one line of code.\n",
812
+ "\n",
813
+ "📚 The libraries documentation 👉 https://github.com/huggingface/huggingface_sb3/tree/main#hugging-face--x-stable-baselines3-v20\n",
814
+ "\n",
815
+ "Here's an example of a Model Card (with Space Invaders):"
816
+ ]
817
+ },
818
+ {
819
+ "cell_type": "markdown",
820
+ "metadata": {
821
+ "id": "Gs-Ew7e1gXN3"
822
+ },
823
+ "source": [
824
+ "By using `package_to_hub` **you evaluate, record a replay, generate a model card of your agent and push it to the hub**.\n",
825
+ "\n",
826
+ "This way:\n",
827
+ "- You can **showcase our work** 🔥\n",
828
+ "- You can **visualize your agent playing** 👀\n",
829
+ "- You can **share with the community an agent that others can use** 💾\n",
830
+ "- You can **access a leaderboard 🏆 to see how well your agent is performing compared to your classmates** 👉 https://huggingface.co/spaces/huggingface-projects/Deep-Reinforcement-Learning-Leaderboard\n"
831
+ ]
832
+ },
833
+ {
834
+ "cell_type": "markdown",
835
+ "metadata": {
836
+ "id": "JquRrWytA6eo"
837
+ },
838
+ "source": [
839
+ "To be able to share your model with the community there are three more steps to follow:\n",
840
+ "\n",
841
+ "1️⃣ (If it's not already done) create an account on Hugging Face ➡ https://huggingface.co/join\n",
842
+ "\n",
843
+ "2️⃣ Sign in and then, you need to store your authentication token from the Hugging Face website.\n",
844
+ "- Create a new token (https://huggingface.co/settings/tokens) **with write role**\n",
845
+ "\n",
846
+ "<img src=\"https://huggingface.co/datasets/huggingface-deep-rl-course/course-images/resolve/main/en/notebooks/create-token.jpg\" alt=\"Create HF Token\">\n",
847
+ "\n",
848
+ "- Copy the token\n",
849
+ "- Run the cell below and paste the token"
850
+ ]
851
+ },
852
+ {
853
+ "cell_type": "code",
854
+ "execution_count": null,
855
+ "metadata": {
856
+ "id": "GZiFBBlzxzxY"
857
+ },
858
+ "outputs": [],
859
+ "source": [
860
+ "notebook_login()\n",
861
+ "!git config --global credential.helper store"
862
+ ]
863
+ },
864
+ {
865
+ "cell_type": "markdown",
866
+ "metadata": {
867
+ "id": "_tsf2uv0g_4p"
868
+ },
869
+ "source": [
870
+ "If you don't want to use a Google Colab or a Jupyter Notebook, you need to use this command instead: `huggingface-cli login`"
871
+ ]
872
+ },
873
+ {
874
+ "cell_type": "markdown",
875
+ "metadata": {
876
+ "id": "FGNh9VsZok0i"
877
+ },
878
+ "source": [
879
+ "3️⃣ We're now ready to push our trained agent to the 🤗 Hub 🔥 using `package_to_hub()` function"
880
+ ]
881
+ },
882
+ {
883
+ "cell_type": "markdown",
884
+ "metadata": {
885
+ "id": "Ay24l6bqFF18"
886
+ },
887
+ "source": [
888
+ "Let's fill the `package_to_hub` function:\n",
889
+ "- `model`: our trained model.\n",
890
+ "- `model_name`: the name of the trained model that we defined in `model_save`\n",
891
+ "- `model_architecture`: the model architecture we used, in our case PPO\n",
892
+ "- `env_id`: the name of the environment, in our case `LunarLander-v2`\n",
893
+ "- `eval_env`: the evaluation environment defined in eval_env\n",
894
+ "- `repo_id`: the name of the Hugging Face Hub Repository that will be created/updated `(repo_id = {username}/{repo_name})`\n",
895
+ "\n",
896
+ "💡 **A good name is {username}/{model_architecture}-{env_id}**\n",
897
+ "\n",
898
+ "- `commit_message`: message of the commit"
899
+ ]
900
+ },
901
+ {
902
+ "cell_type": "code",
903
+ "execution_count": null,
904
+ "metadata": {
905
+ "id": "JPG7ofdGIHN8"
906
+ },
907
+ "outputs": [],
908
+ "source": [
909
+ "import gymnasium as gym\n",
910
+ "\n",
911
+ "from stable_baselines3 import PPO\n",
912
+ "from stable_baselines3.common.vec_env import DummyVecEnv\n",
913
+ "from stable_baselines3.common.env_util import make_vec_env\n",
914
+ "\n",
915
+ "from huggingface_sb3 import package_to_hub\n",
916
+ "\n",
917
+ "# PLACE the variables you've just defined two cells above\n",
918
+ "# Define the name of the environment\n",
919
+ "env_id = \"LunarLander-v2\"\n",
920
+ "\n",
921
+ "# TODO: Define the model architecture we used\n",
922
+ "model_architecture = \"PPO\"\n",
923
+ "\n",
924
+ "## Define a repo_id\n",
925
+ "## repo_id is the id of the model repository from the Hugging Face Hub (repo_id = {organization}/{repo_name} for instance ThomasSimonini/ppo-LunarLander-v2\n",
926
+ "## CHANGE WITH YOUR REPO ID\n",
927
+ "repo_id = \"Gyaneshere/ppo-LunarLander-v2\" # Change with your repo id, you can't push with mine 😄\n",
928
+ "\n",
929
+ "## Define the commit message\n",
930
+ "commit_message = \"Upload PPO LunarLander-v2 trained agent\"\n",
931
+ "\n",
932
+ "# Create the evaluation env and set the render_mode=\"rgb_array\"\n",
933
+ "eval_env = DummyVecEnv([lambda: gym.make(env_id, render_mode=\"rgb_array\")])\n",
934
+ "\n",
935
+ "# PLACE the package_to_hub function you've just filled here\n",
936
+ "package_to_hub(model=model, # Our trained model\n",
937
+ " model_name=model_name, # The name of our trained model\n",
938
+ " model_architecture=model_architecture, # The model architecture we used: in our case PPO\n",
939
+ " env_id=env_id, # Name of the environment\n",
940
+ " eval_env=eval_env, # Evaluation Environment\n",
941
+ " repo_id=repo_id, # id of the model repository from the Hugging Face Hub (repo_id = {organization}/{repo_name} for instance ThomasSimonini/ppo-LunarLander-v2\n",
942
+ " commit_message=commit_message)"
943
+ ]
944
+ },
945
+ {
946
+ "cell_type": "markdown",
947
+ "metadata": {
948
+ "id": "Avf6gufJBGMw"
949
+ },
950
+ "source": [
951
+ "#### Solution\n"
952
+ ]
953
+ },
954
+ {
955
+ "cell_type": "code",
956
+ "execution_count": null,
957
+ "metadata": {
958
+ "id": "I2E--IJu8JYq"
959
+ },
960
+ "outputs": [],
961
+ "source": [
962
+ "import gymnasium as gym\n",
963
+ "\n",
964
+ "from stable_baselines3 import PPO\n",
965
+ "from stable_baselines3.common.vec_env import DummyVecEnv\n",
966
+ "from stable_baselines3.common.env_util import make_vec_env\n",
967
+ "\n",
968
+ "from huggingface_sb3 import package_to_hub\n",
969
+ "\n",
970
+ "# PLACE the variables you've just defined two cells above\n",
971
+ "# Define the name of the environment\n",
972
+ "env_id = \"LunarLander-v2\"\n",
973
+ "\n",
974
+ "# TODO: Define the model architecture we used\n",
975
+ "model_architecture = \"PPO\"\n",
976
+ "\n",
977
+ "## Define a repo_id\n",
978
+ "## repo_id is the id of the model repository from the Hugging Face Hub (repo_id = {organization}/{repo_name} for instance ThomasSimonini/ppo-LunarLander-v2\n",
979
+ "## CHANGE WITH YOUR REPO ID\n",
980
+ "repo_id = \"ThomasSimonini/ppo-LunarLander-v2\" # Change with your repo id, you can't push with mine 😄\n",
981
+ "\n",
982
+ "## Define the commit message\n",
983
+ "commit_message = \"Upload PPO LunarLander-v2 trained agent\"\n",
984
+ "\n",
985
+ "# Create the evaluation env and set the render_mode=\"rgb_array\"\n",
986
+ "eval_env = DummyVecEnv([lambda: gym.make(env_id, render_mode=\"rgb_array\")])\n",
987
+ "\n",
988
+ "# PLACE the package_to_hub function you've just filled here\n",
989
+ "package_to_hub(model=model, # Our trained model\n",
990
+ " model_name=model_name, # The name of our trained model\n",
991
+ " model_architecture=model_architecture, # The model architecture we used: in our case PPO\n",
992
+ " env_id=env_id, # Name of the environment\n",
993
+ " eval_env=eval_env, # Evaluation Environment\n",
994
+ " repo_id=repo_id, # id of the model repository from the Hugging Face Hub (repo_id = {organization}/{repo_name} for instance ThomasSimonini/ppo-LunarLander-v2\n",
995
+ " commit_message=commit_message)\n"
996
+ ]
997
+ },
998
+ {
999
+ "cell_type": "markdown",
1000
+ "metadata": {
1001
+ "id": "T79AEAWEFIxz"
1002
+ },
1003
+ "source": [
1004
+ "Congrats 🥳 you've just trained and uploaded your first Deep Reinforcement Learning agent. The script above should have displayed a link to a model repository such as https://huggingface.co/osanseviero/test_sb3. When you go to this link, you can:\n",
1005
+ "* See a video preview of your agent at the right.\n",
1006
+ "* Click \"Files and versions\" to see all the files in the repository.\n",
1007
+ "* Click \"Use in stable-baselines3\" to get a code snippet that shows how to load the model.\n",
1008
+ "* A model card (`README.md` file) which gives a description of the model\n",
1009
+ "\n",
1010
+ "Under the hood, the Hub uses git-based repositories (don't worry if you don't know what git is), which means you can update the model with new versions as you experiment and improve your agent.\n",
1011
+ "\n",
1012
+ "Compare the results of your LunarLander-v2 with your classmates using the leaderboard 🏆 👉 https://huggingface.co/spaces/huggingface-projects/Deep-Reinforcement-Learning-Leaderboard"
1013
+ ]
1014
+ },
1015
+ {
1016
+ "cell_type": "markdown",
1017
+ "metadata": {
1018
+ "id": "9nWnuQHRfFRa"
1019
+ },
1020
+ "source": [
1021
+ "## Load a saved LunarLander model from the Hub 🤗\n",
1022
+ "Thanks to [ironbar](https://github.com/ironbar) for the contribution.\n",
1023
+ "\n",
1024
+ "Loading a saved model from the Hub is really easy.\n",
1025
+ "\n",
1026
+ "You go to https://huggingface.co/models?library=stable-baselines3 to see the list of all the Stable-baselines3 saved models.\n",
1027
+ "1. You select one and copy its repo_id\n",
1028
+ "\n",
1029
+ "<img src=\"https://huggingface.co/datasets/huggingface-deep-rl-course/course-images/resolve/main/en/notebooks/unit1/copy-id.png\" alt=\"Copy-id\"/>"
1030
+ ]
1031
+ },
1032
+ {
1033
+ "cell_type": "markdown",
1034
+ "metadata": {
1035
+ "id": "hNPLJF2bfiUw"
1036
+ },
1037
+ "source": [
1038
+ "2. Then we just need to use load_from_hub with:\n",
1039
+ "- The repo_id\n",
1040
+ "- The filename: the saved model inside the repo and its extension (*.zip)"
1041
+ ]
1042
+ },
1043
+ {
1044
+ "cell_type": "markdown",
1045
+ "metadata": {
1046
+ "id": "bhb9-NtsinKB"
1047
+ },
1048
+ "source": [
1049
+ "Because the model I download from the Hub was trained with Gym (the former version of Gymnasium) we need to install shimmy a API conversion tool that will help us to run the environment correctly.\n",
1050
+ "\n",
1051
+ "Shimmy Documentation: https://github.com/Farama-Foundation/Shimmy"
1052
+ ]
1053
+ },
1054
+ {
1055
+ "cell_type": "code",
1056
+ "execution_count": null,
1057
+ "metadata": {
1058
+ "id": "03WI-bkci1kH"
1059
+ },
1060
+ "outputs": [],
1061
+ "source": [
1062
+ "!pip install gymnasium==0.29\n",
1063
+ "!pip install shimmy==1.3.0"
1064
+ ]
1065
+ },
1066
+ {
1067
+ "cell_type": "code",
1068
+ "execution_count": null,
1069
+ "metadata": {
1070
+ "id": "oj8PSGHJfwz3"
1071
+ },
1072
+ "outputs": [],
1073
+ "source": [
1074
+ "from huggingface_sb3 import load_from_hub\n",
1075
+ "from stable_baselines3 import PPO\n",
1076
+ "\n",
1077
+ "repo_id = \"Gyaneshere/ppo-LunarLander-v2\" # The repo_id\n",
1078
+ "filename = \"ppo-LunarLander-v2.zip\" # The model filename.zip\n",
1079
+ "\n",
1080
+ "# When the model was trained on Python 3.8 the pickle protocol is 5\n",
1081
+ "# But Python 3.6, 3.7 use protocol 4\n",
1082
+ "# In order to get compatibility we need to:\n",
1083
+ "# 1. Install pickle5 (we done it at the beginning of the colab)\n",
1084
+ "# 2. Create a custom empty object we pass as parameter to PPO.load()\n",
1085
+ "custom_objects = {\n",
1086
+ " \"learning_rate\": 0.0,\n",
1087
+ " \"lr_schedule\": lambda _: 0.0,\n",
1088
+ " \"clip_range\": lambda _: 0.0,\n",
1089
+ "}\n",
1090
+ "\n",
1091
+ "checkpoint = load_from_hub(repo_id, filename)\n",
1092
+ "model = PPO.load(checkpoint, custom_objects=custom_objects, print_system_info=True)"
1093
+ ]
1094
+ },
1095
+ {
1096
+ "cell_type": "markdown",
1097
+ "metadata": {
1098
+ "id": "Fs0Y-qgPgLUf"
1099
+ },
1100
+ "source": [
1101
+ "Let's evaluate this agent:"
1102
+ ]
1103
+ },
1104
+ {
1105
+ "cell_type": "code",
1106
+ "execution_count": null,
1107
+ "metadata": {
1108
+ "id": "PAEVwK-aahfx"
1109
+ },
1110
+ "outputs": [],
1111
+ "source": [
1112
+ "from stable_baselines3.common.monitor import Monitor\n",
1113
+ "import gymnasium as gym\n",
1114
+ "from stable_baselines3.common.evaluation import evaluate_policy\n",
1115
+ "\n",
1116
+ "#@title\n",
1117
+ "eval_env = Monitor(gym.make(\"LunarLander-v2\"))\n",
1118
+ "mean_reward, std_reward = evaluate_policy(model, eval_env, n_eval_episodes=10, deterministic=True)\n",
1119
+ "print(f\"mean_reward={mean_reward:.2f} +/- {std_reward}\")"
1120
+ ]
1121
+ },
1122
+ {
1123
+ "cell_type": "markdown",
1124
+ "metadata": {
1125
+ "id": "BQAwLnYFPk-s"
1126
+ },
1127
+ "source": [
1128
+ "## Some additional challenges 🏆\n",
1129
+ "The best way to learn **is to try things by your own**! As you saw, the current agent is not doing great. As a first suggestion, you can train for more steps. With 1,000,000 steps, we saw some great results!\n",
1130
+ "\n",
1131
+ "In the [Leaderboard](https://huggingface.co/spaces/huggingface-projects/Deep-Reinforcement-Learning-Leaderboard) you will find your agents. Can you get to the top?\n",
1132
+ "\n",
1133
+ "Here are some ideas to achieve so:\n",
1134
+ "* Train more steps\n",
1135
+ "* Try different hyperparameters for `PPO`. You can see them at https://stable-baselines3.readthedocs.io/en/master/modules/ppo.html#parameters.\n",
1136
+ "* Check the [Stable-Baselines3 documentation](https://stable-baselines3.readthedocs.io/en/master/modules/dqn.html) and try another model such as DQN.\n",
1137
+ "* **Push your new trained model** on the Hub 🔥\n",
1138
+ "\n",
1139
+ "**Compare the results of your LunarLander-v2 with your classmates** using the [leaderboard](https://huggingface.co/spaces/huggingface-projects/Deep-Reinforcement-Learning-Leaderboard) 🏆\n",
1140
+ "\n",
1141
+ "Is moon landing too boring for you? Try to **change the environment**, why not use MountainCar-v0, CartPole-v1 or CarRacing-v0? Check how they work [using the gym documentation](https://www.gymlibrary.dev/) and have fun 🎉."
1142
+ ]
1143
+ },
1144
+ {
1145
+ "cell_type": "markdown",
1146
+ "metadata": {
1147
+ "id": "9lM95-dvmif8"
1148
+ },
1149
+ "source": [
1150
+ "________________________________________________________________________\n",
1151
+ "Congrats on finishing this chapter! That was the biggest one, **and there was a lot of information.**\n",
1152
+ "\n",
1153
+ "If you’re still feel confused with all these elements...it's totally normal! **This was the same for me and for all people who studied RL.**\n",
1154
+ "\n",
1155
+ "Take time to really **grasp the material before continuing and try the additional challenges**. It’s important to master these elements and have a solid foundations.\n",
1156
+ "\n",
1157
+ "Naturally, during the course, we’re going to dive deeper into these concepts but **it’s better to have a good understanding of them now before diving into the next chapters.**\n",
1158
+ "\n"
1159
+ ]
1160
+ },
1161
+ {
1162
+ "cell_type": "markdown",
1163
+ "metadata": {
1164
+ "id": "BjLhT70TEZIn"
1165
+ },
1166
+ "source": [
1167
+ "Next time, in the bonus unit 1, you'll train Huggy the Dog to fetch the stick.\n",
1168
+ "\n",
1169
+ "<img src=\"https://huggingface.co/datasets/huggingface-deep-rl-course/course-images/resolve/main/en/notebooks/unit1/huggy.jpg\" alt=\"Huggy\"/>\n",
1170
+ "\n",
1171
+ "## Keep learning, stay awesome 🤗"
1172
+ ]
1173
+ }
1174
+ ],
1175
+ "metadata": {
1176
+ "accelerator": "GPU",
1177
+ "colab": {
1178
+ "collapsed_sections": [
1179
+ "QAN7B0_HCVZC",
1180
+ "BqPKw3jt_pG5"
1181
+ ],
1182
+ "private_outputs": true,
1183
+ "provenance": [],
1184
+ "gpuType": "T4"
1185
+ },
1186
+ "kernelspec": {
1187
+ "display_name": "Python 3",
1188
+ "name": "python3"
1189
+ },
1190
+ "language_info": {
1191
+ "name": "python",
1192
+ "version": "3.9.7"
1193
+ },
1194
+ "vscode": {
1195
+ "interpreter": {
1196
+ "hash": "ed7f8024e43d3b8f5ca3c5e1a8151ab4d136b3ecee1e3fd59e0766ccc55e1b10"
1197
+ }
1198
+ }
1199
+ },
1200
+ "nbformat": 4,
1201
+ "nbformat_minor": 0
1202
+ }