Spaces:

molssiai-hub
/

pytorch-training-loop

Sleeping

App Files Files Community

smostafanejad commited on Jul 25, 2024

Commit

1f0bc98

verified ·

1 Parent(s): db3d77d

Upload pytorch_training_loop.ipynb

Browse files

Files changed (1) hide show

pytorch_training_loop.ipynb +307 -0

pytorch_training_loop.ipynb ADDED Viewed

	@@ -0,0 +1,307 @@

+{
+ "cells": [
+  {
+   "cell_type": "markdown",
+   "metadata": {},
+   "source": [
+    "# Building a PyTorch Training Loop"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "metadata": {},
+   "source": [
+    "In order to be able to access the data on Hugging Face Hub and build the\n",
+    "data loaders for our training loop, we should import the necessary libraries\n",
+    "first"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "from datasets import load_dataset           # Loading datasets from Hugging Face Hub\n",
+    "import torch                                # PyTorch\n",
+    "from torch.utils.data import DataLoader     # PyTorch DataLoader for creating batches\n",
+    "from pprint import pprint                   # Pretty print\n",
+    "from tqdm import tqdm                       # Progress bar"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "metadata": {},
+   "source": [
+    "In this tutorial, we are going to work with the\n",
+    "[PubChemQC-B3LYP/6-31G*//PM6\n",
+    "Dataset](https://huggingface.co/datasets/molssiai-hub/pubchemqc-b3lyp)\n",
+    "(PubChemQC-B3LYP for short) from the [PubChemQC dataset\n",
+    "collection](https://huggingface.co/collections/molssiai-hub/pubchemqc-datasets-669e5482260861ba7cce3d1c).\n",
+    "Let us set a few variables and load the dataset as shown below"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "metadata": {},
+   "source": [
+    "After importing the modules, we set a few variables that will be used throughout\n",
+    "this demo."
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "# path to the dataset repository on the Hugging Face Hub\n",
+    "path = \"molssiai-hub/pubchemqc-b3lyp\"\n",
+    "\n",
+    "# set the dataset configuration/subset name\n",
+    "name = \"b3lyp_pm6\"\n",
+    "\n",
+    "# set the dataset split\n",
+    "split = \"train\"\n",
+    "\n",
+    "# load the dataset\n",
+    "hub_dataset = load_dataset(path=path,\n",
+    "                           name=name,\n",
+    "                           split=split,\n",
+    "                           streaming=True,\n",
+    "                           trust_remote_code=True)"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "metadata": {},
+   "source": [
+    "Here, we set the `streaming` parameter to `True` to avoid downloading the\n",
+    "dataset on disk and ensure streaming the data from the hub. In this mode, the\n",
+    "`load_dataset` function returns an `IterableDataset` object that can be iterated\n",
+    "over and provide access to the data. The `trust_remote_code` argument is also\n",
+    "set to `True` to allow the usage of a custom [load\n",
+    "script](https://huggingface.co/datasets/molssiai-hub/pubchemqc-b3lyp/blob/main/pubchemqc-b3lyp.py)\n",
+    "for the data."
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "metadata": {},
+   "source": [
+    "By default, the Hugging Face data objects' `__getitem__` method returns a native\n",
+    "Python object (e.g., a dictionary). However, we can use the `with_format()`\n",
+    "method to specify the format of the data we want to access. In our case, we want\n",
+    "to use the `torch.tensor` format to build the data loaders for our training\n",
+    "loop. Let us transform our data and check the result"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "# set the dataset format to PyTorch tensors\n",
+    "hub_dataset = hub_dataset.with_format(\"torch\")\n",
+    "\n",
+    "# fetch the first data point\n",
+    "next(iter(hub_dataset.take(1)))"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "metadata": {},
+   "source": [
+    "We can see that the type of the numerical features in our data sample are\n",
+    "transformed to `torch.tensor` objects. Let us access the `coordinates` field\n",
+    "to make this more clear"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "# fetch the first data point\n",
+    "data_point = next(iter(hub_dataset.take(1)))\n",
+    "\n",
+    "# print the coordinates of the first data point and its type\n",
+    "data_point[\"coordinates\"], type(data_point[\"coordinates\"])"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "metadata": {},
+   "source": [
+    "In the code snippet above, we have wrapped the `IterableDataset` object, `hub_dataset`,\n",
+    "inside an `iter()` function to create an iterator object and used the `next()` function\n",
+    "to iterate once over it and access the first sample in it."
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "metadata": {},
+   "source": [
+    "Our PubChemQC-B3LYP `IterableDataset` object is divided into multiple shards\n",
+    "to enable multiprocessing and help shuffling the data."
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "print(f\"the PubChemQC-B3LYP dataset has {hub_dataset.n_shards} shards\")"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "metadata": {},
+   "source": [
+    "If we want to shuffle our data, the shards will also be shuffled. This is\n",
+    "important to consider when building the PyTorch data loaders for our training\n",
+    "loop."
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "# shuffle the dataset\n",
+    "hub_dataset = hub_dataset.shuffle(seed=123, buffer_size=1000)"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "metadata": {},
+   "source": [
+    "The `buffer_size` controls the size of a container object from which we randomly\n",
+    "sample examples from. For instance, when we call the `IterableDataset.shuffle()`\n",
+    "function, the first thousand examples in the buffer are randomly sampled and the\n",
+    "selected examples in the buffer are then replaced with new examples from the\n",
+    "dataset. The `buffer_size` argument is set to 1000 by default. "
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "metadata": {},
+   "source": [
+    "A nice feature of the Hugging Face dataset objects is that they can be directly\n",
+    "passed to PyTorch DataLoaders as shown below"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "# create a PyTorch DataLoader with a batch size of 4\n",
+    "dataloader = DataLoader(hub_dataset, batch_size=4, collate_fn=lambda x: x)"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "metadata": {},
+   "source": [
+    "By default, the `DataLoader` object will use a default collator function which\n",
+    "creates batches of data and transforms them into `torch.tensors`. For our\n",
+    "dataset examples, however, we cannot use the default collator function because\n",
+    "our data samples are not of the same length (different molecules may have\n",
+    "different number of atoms and coordinates). To circumvent this problem, we can\n",
+    "define a lambda function that yields each data point, which is a dictionary,\n",
+    "without any transformation."
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "metadata": {},
+   "source": [
+    "Similar to the `hub_dataset`, we can also wrap the `dataloader` object inside an\n",
+    "iterator and use the `next()` function to access the first batch of data "
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "data_point = next(iter(dataloader))\n",
+    "data_point[0][\"coordinates\"]"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "metadata": {},
+   "source": [
+    "## Building a Training Loop in PyTorch"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "metadata": {},
+   "source": [
+    "Now that we know how to access, fetch and shuffle batches of data samples in our\n",
+    "PyTorch data loader, we can build a simple training loop to train a model"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "# set up the training loop\n",
+    "for epoch in range(1, 4, 1):\n",
+    "\n",
+    "    # set the epoch\n",
+    "    hub_dataset.set_epoch(epoch)\n",
+    "\n",
+    "    # iterate over the batches in the DataLoader\n",
+    "    for i, batch in enumerate(tqdm(dataloader, total=4, desc=f\"Epoch {epoch}\")):\n",
+    "        if i == 4:\n",
+    "            pprint(f\"The isomeric SMILES from the first data point of the {i}th batch: {batch[0]['pubchem-isomeric-smiles']}\",\n",
+    "                   width=100,\n",
+    "                   compact=True)\n",
+    "            break\n",
+    "        print(f\"Epoch: {epoch}, Batch: {i+1}, Batch size: {len(batch)}\")"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "metadata": {},
+   "source": [
+    "In the code snippet above, we have used the `set_epoch(epoch)` function which\n",
+    "is often used with PyTorch data loaders and in distributed settings to augment the\n",
+    "random seed for reshuffling at the beginning of each epoch."
+   ]
+  }
+ ],
+ "metadata": {
+  "kernelspec": {
+   "display_name": "hugface",
+   "language": "python",
+   "name": "python3"
+  },
+  "language_info": {
+   "codemirror_mode": {
+    "name": "ipython",
+    "version": 3
+   },
+   "file_extension": ".py",
+   "mimetype": "text/x-python",
+   "name": "python",
+   "nbconvert_exporter": "python",
+   "pygments_lexer": "ipython3",
+   "version": "3.10.13"
+  }
+ },
+ "nbformat": 4,
+ "nbformat_minor": 2
+}