smostafanejad commited on
Commit
1f0bc98
·
verified ·
1 Parent(s): db3d77d

Upload pytorch_training_loop.ipynb

Browse files
Files changed (1) hide show
  1. pytorch_training_loop.ipynb +307 -0
pytorch_training_loop.ipynb ADDED
@@ -0,0 +1,307 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ {
2
+ "cells": [
3
+ {
4
+ "cell_type": "markdown",
5
+ "metadata": {},
6
+ "source": [
7
+ "# Building a PyTorch Training Loop"
8
+ ]
9
+ },
10
+ {
11
+ "cell_type": "markdown",
12
+ "metadata": {},
13
+ "source": [
14
+ "In order to be able to access the data on Hugging Face Hub and build the\n",
15
+ "data loaders for our training loop, we should import the necessary libraries\n",
16
+ "first"
17
+ ]
18
+ },
19
+ {
20
+ "cell_type": "code",
21
+ "execution_count": null,
22
+ "metadata": {},
23
+ "outputs": [],
24
+ "source": [
25
+ "from datasets import load_dataset # Loading datasets from Hugging Face Hub\n",
26
+ "import torch # PyTorch\n",
27
+ "from torch.utils.data import DataLoader # PyTorch DataLoader for creating batches\n",
28
+ "from pprint import pprint # Pretty print\n",
29
+ "from tqdm import tqdm # Progress bar"
30
+ ]
31
+ },
32
+ {
33
+ "cell_type": "markdown",
34
+ "metadata": {},
35
+ "source": [
36
+ "In this tutorial, we are going to work with the\n",
37
+ "[PubChemQC-B3LYP/6-31G*//PM6\n",
38
+ "Dataset](https://huggingface.co/datasets/molssiai-hub/pubchemqc-b3lyp)\n",
39
+ "(PubChemQC-B3LYP for short) from the [PubChemQC dataset\n",
40
+ "collection](https://huggingface.co/collections/molssiai-hub/pubchemqc-datasets-669e5482260861ba7cce3d1c).\n",
41
+ "Let us set a few variables and load the dataset as shown below"
42
+ ]
43
+ },
44
+ {
45
+ "cell_type": "markdown",
46
+ "metadata": {},
47
+ "source": [
48
+ "After importing the modules, we set a few variables that will be used throughout\n",
49
+ "this demo."
50
+ ]
51
+ },
52
+ {
53
+ "cell_type": "code",
54
+ "execution_count": null,
55
+ "metadata": {},
56
+ "outputs": [],
57
+ "source": [
58
+ "# path to the dataset repository on the Hugging Face Hub\n",
59
+ "path = \"molssiai-hub/pubchemqc-b3lyp\"\n",
60
+ "\n",
61
+ "# set the dataset configuration/subset name\n",
62
+ "name = \"b3lyp_pm6\"\n",
63
+ "\n",
64
+ "# set the dataset split\n",
65
+ "split = \"train\"\n",
66
+ "\n",
67
+ "# load the dataset\n",
68
+ "hub_dataset = load_dataset(path=path,\n",
69
+ " name=name,\n",
70
+ " split=split,\n",
71
+ " streaming=True,\n",
72
+ " trust_remote_code=True)"
73
+ ]
74
+ },
75
+ {
76
+ "cell_type": "markdown",
77
+ "metadata": {},
78
+ "source": [
79
+ "Here, we set the `streaming` parameter to `True` to avoid downloading the\n",
80
+ "dataset on disk and ensure streaming the data from the hub. In this mode, the\n",
81
+ "`load_dataset` function returns an `IterableDataset` object that can be iterated\n",
82
+ "over and provide access to the data. The `trust_remote_code` argument is also\n",
83
+ "set to `True` to allow the usage of a custom [load\n",
84
+ "script](https://huggingface.co/datasets/molssiai-hub/pubchemqc-b3lyp/blob/main/pubchemqc-b3lyp.py)\n",
85
+ "for the data."
86
+ ]
87
+ },
88
+ {
89
+ "cell_type": "markdown",
90
+ "metadata": {},
91
+ "source": [
92
+ "By default, the Hugging Face data objects' `__getitem__` method returns a native\n",
93
+ "Python object (e.g., a dictionary). However, we can use the `with_format()`\n",
94
+ "method to specify the format of the data we want to access. In our case, we want\n",
95
+ "to use the `torch.tensor` format to build the data loaders for our training\n",
96
+ "loop. Let us transform our data and check the result"
97
+ ]
98
+ },
99
+ {
100
+ "cell_type": "code",
101
+ "execution_count": null,
102
+ "metadata": {},
103
+ "outputs": [],
104
+ "source": [
105
+ "# set the dataset format to PyTorch tensors\n",
106
+ "hub_dataset = hub_dataset.with_format(\"torch\")\n",
107
+ "\n",
108
+ "# fetch the first data point\n",
109
+ "next(iter(hub_dataset.take(1)))"
110
+ ]
111
+ },
112
+ {
113
+ "cell_type": "markdown",
114
+ "metadata": {},
115
+ "source": [
116
+ "We can see that the type of the numerical features in our data sample are\n",
117
+ "transformed to `torch.tensor` objects. Let us access the `coordinates` field\n",
118
+ "to make this more clear"
119
+ ]
120
+ },
121
+ {
122
+ "cell_type": "code",
123
+ "execution_count": null,
124
+ "metadata": {},
125
+ "outputs": [],
126
+ "source": [
127
+ "# fetch the first data point\n",
128
+ "data_point = next(iter(hub_dataset.take(1)))\n",
129
+ "\n",
130
+ "# print the coordinates of the first data point and its type\n",
131
+ "data_point[\"coordinates\"], type(data_point[\"coordinates\"])"
132
+ ]
133
+ },
134
+ {
135
+ "cell_type": "markdown",
136
+ "metadata": {},
137
+ "source": [
138
+ "In the code snippet above, we have wrapped the `IterableDataset` object, `hub_dataset`,\n",
139
+ "inside an `iter()` function to create an iterator object and used the `next()` function\n",
140
+ "to iterate once over it and access the first sample in it."
141
+ ]
142
+ },
143
+ {
144
+ "cell_type": "markdown",
145
+ "metadata": {},
146
+ "source": [
147
+ "Our PubChemQC-B3LYP `IterableDataset` object is divided into multiple shards\n",
148
+ "to enable multiprocessing and help shuffling the data."
149
+ ]
150
+ },
151
+ {
152
+ "cell_type": "code",
153
+ "execution_count": null,
154
+ "metadata": {},
155
+ "outputs": [],
156
+ "source": [
157
+ "print(f\"the PubChemQC-B3LYP dataset has {hub_dataset.n_shards} shards\")"
158
+ ]
159
+ },
160
+ {
161
+ "cell_type": "markdown",
162
+ "metadata": {},
163
+ "source": [
164
+ "If we want to shuffle our data, the shards will also be shuffled. This is\n",
165
+ "important to consider when building the PyTorch data loaders for our training\n",
166
+ "loop."
167
+ ]
168
+ },
169
+ {
170
+ "cell_type": "code",
171
+ "execution_count": null,
172
+ "metadata": {},
173
+ "outputs": [],
174
+ "source": [
175
+ "# shuffle the dataset\n",
176
+ "hub_dataset = hub_dataset.shuffle(seed=123, buffer_size=1000)"
177
+ ]
178
+ },
179
+ {
180
+ "cell_type": "markdown",
181
+ "metadata": {},
182
+ "source": [
183
+ "The `buffer_size` controls the size of a container object from which we randomly\n",
184
+ "sample examples from. For instance, when we call the `IterableDataset.shuffle()`\n",
185
+ "function, the first thousand examples in the buffer are randomly sampled and the\n",
186
+ "selected examples in the buffer are then replaced with new examples from the\n",
187
+ "dataset. The `buffer_size` argument is set to 1000 by default. "
188
+ ]
189
+ },
190
+ {
191
+ "cell_type": "markdown",
192
+ "metadata": {},
193
+ "source": [
194
+ "A nice feature of the Hugging Face dataset objects is that they can be directly\n",
195
+ "passed to PyTorch DataLoaders as shown below"
196
+ ]
197
+ },
198
+ {
199
+ "cell_type": "code",
200
+ "execution_count": null,
201
+ "metadata": {},
202
+ "outputs": [],
203
+ "source": [
204
+ "# create a PyTorch DataLoader with a batch size of 4\n",
205
+ "dataloader = DataLoader(hub_dataset, batch_size=4, collate_fn=lambda x: x)"
206
+ ]
207
+ },
208
+ {
209
+ "cell_type": "markdown",
210
+ "metadata": {},
211
+ "source": [
212
+ "By default, the `DataLoader` object will use a default collator function which\n",
213
+ "creates batches of data and transforms them into `torch.tensors`. For our\n",
214
+ "dataset examples, however, we cannot use the default collator function because\n",
215
+ "our data samples are not of the same length (different molecules may have\n",
216
+ "different number of atoms and coordinates). To circumvent this problem, we can\n",
217
+ "define a lambda function that yields each data point, which is a dictionary,\n",
218
+ "without any transformation."
219
+ ]
220
+ },
221
+ {
222
+ "cell_type": "markdown",
223
+ "metadata": {},
224
+ "source": [
225
+ "Similar to the `hub_dataset`, we can also wrap the `dataloader` object inside an\n",
226
+ "iterator and use the `next()` function to access the first batch of data "
227
+ ]
228
+ },
229
+ {
230
+ "cell_type": "code",
231
+ "execution_count": null,
232
+ "metadata": {},
233
+ "outputs": [],
234
+ "source": [
235
+ "data_point = next(iter(dataloader))\n",
236
+ "data_point[0][\"coordinates\"]"
237
+ ]
238
+ },
239
+ {
240
+ "cell_type": "markdown",
241
+ "metadata": {},
242
+ "source": [
243
+ "## Building a Training Loop in PyTorch"
244
+ ]
245
+ },
246
+ {
247
+ "cell_type": "markdown",
248
+ "metadata": {},
249
+ "source": [
250
+ "Now that we know how to access, fetch and shuffle batches of data samples in our\n",
251
+ "PyTorch data loader, we can build a simple training loop to train a model"
252
+ ]
253
+ },
254
+ {
255
+ "cell_type": "code",
256
+ "execution_count": null,
257
+ "metadata": {},
258
+ "outputs": [],
259
+ "source": [
260
+ "# set up the training loop\n",
261
+ "for epoch in range(1, 4, 1):\n",
262
+ "\n",
263
+ " # set the epoch\n",
264
+ " hub_dataset.set_epoch(epoch)\n",
265
+ "\n",
266
+ " # iterate over the batches in the DataLoader\n",
267
+ " for i, batch in enumerate(tqdm(dataloader, total=4, desc=f\"Epoch {epoch}\")):\n",
268
+ " if i == 4:\n",
269
+ " pprint(f\"The isomeric SMILES from the first data point of the {i}th batch: {batch[0]['pubchem-isomeric-smiles']}\",\n",
270
+ " width=100,\n",
271
+ " compact=True)\n",
272
+ " break\n",
273
+ " print(f\"Epoch: {epoch}, Batch: {i+1}, Batch size: {len(batch)}\")"
274
+ ]
275
+ },
276
+ {
277
+ "cell_type": "markdown",
278
+ "metadata": {},
279
+ "source": [
280
+ "In the code snippet above, we have used the `set_epoch(epoch)` function which\n",
281
+ "is often used with PyTorch data loaders and in distributed settings to augment the\n",
282
+ "random seed for reshuffling at the beginning of each epoch."
283
+ ]
284
+ }
285
+ ],
286
+ "metadata": {
287
+ "kernelspec": {
288
+ "display_name": "hugface",
289
+ "language": "python",
290
+ "name": "python3"
291
+ },
292
+ "language_info": {
293
+ "codemirror_mode": {
294
+ "name": "ipython",
295
+ "version": 3
296
+ },
297
+ "file_extension": ".py",
298
+ "mimetype": "text/x-python",
299
+ "name": "python",
300
+ "nbconvert_exporter": "python",
301
+ "pygments_lexer": "ipython3",
302
+ "version": "3.10.13"
303
+ }
304
+ },
305
+ "nbformat": 4,
306
+ "nbformat_minor": 2
307
+ }