{
 "cells": [
  {
   "cell_type": "markdown",
   "id": "4b572b87",
   "metadata": {},
   "source": [
    "# Working with Metadata"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "a11900d1",
   "metadata": {},
   "source": [
    "This demo focuses on getting to know the data we are going to work with before\n",
    "downloading it and start processing it. Here, we are going to use the\n",
    "[PubChemQC-B3LYP/6-31G*//PM6\n",
    "Dataset](https://huggingface.co/datasets/molssiai-hub/pubchemqc-b3lyp) (PubChemQC-B3LYP for short)\n",
    "from the [MolSSI AI Hub](https://huggingface.co/molssiai-hub).\n",
    "\n",
    "\n",
    "In order to be able to interact with the data, we need to import the necessary libraries."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "2493a0c3-4b27-496a-9514-32fb4941c94e",
   "metadata": {
    "tags": []
   },
   "outputs": [],
   "source": [
    "import datasets                         # Hugging Face datasets library\n",
    "from datasets import (\n",
    "    get_dataset_config_info,            # Get information about a dataset configurations/subsets\n",
    "    get_dataset_config_names,           # Get the list of names of all dataset configurations/subsets\n",
    "    get_dataset_split_names,            # Get the list of names of all dataset splits\n",
    "    get_dataset_default_config_name     # Get the default configuration name of a dataset\n",
    ")\n",
    "from pprint import pprint               # Pretty print"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "b362b39e",
   "metadata": {},
   "source": [
    "After importing the modules, we set a few variables that will be used throughout\n",
    "this demo."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "c4892171-f79f-4eee-99db-a21d11b09e5c",
   "metadata": {
    "tags": []
   },
   "outputs": [],
   "source": [
    "# path to the dataset repository on the Hugging Face Hub\n",
    "path = \"molssiai-hub/pubchemqc-b3lyp\"\n",
    "\n",
    "# set the dataset configuration/subset name\n",
    "name = \"b3lyp_pm6\"\n",
    "\n",
    "# set the dataset split\n",
    "split = \"train\""
   ]
  },
  {
   "cell_type": "markdown",
   "id": "affb8bf2",
   "metadata": {},
   "source": [
    "The modules and functions we imported allow us to inspect our dataset for a wide\n",
    "range of metadata and information without actually downloading it on disk. For\n",
    "example, we can access the list of all available configurations/subsets, splits,\n",
    "and the default configuration name in our dataset.\n",
    "\n",
    "The `get_dataset_config_info` function returns a `datasets.info.DatasetInfo` \n",
    "object which contains the metadata of our dataset configuration all in one place."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "b0b1b1cf",
   "metadata": {},
   "outputs": [],
   "source": [
    "# get the information about the PubChemQC-B3LYP dataset configuration/subset\n",
    "config_info = get_dataset_config_info(path, name, trust_remote_code=True)\n",
    "\n",
    "# print the retrieved information about the PubChemQC-B3LYP dataset\n",
    "print(\"Information about the PubChemQC-B3LYP dataset configuration/subset:\")\n",
    "pprint(config_info,\n",
    "       indent=4,\n",
    "       width=100,\n",
    "       compact=True)"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "c01684ca",
   "metadata": {},
   "source": [
    "Processing a lengthy output is not always convenient. In order to make the\n",
    "metadata inspection easier, we can access specific attributes of the\n",
    "`datasets.info.DatasetInfo` instance directly. For example, the `description`\n",
    "attribute can provide access to the content of the *description* field in the\n",
    "dataset configuration as shown below"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "d7ca2dff",
   "metadata": {},
   "outputs": [],
   "source": [
    "pprint(config_info.description,\n",
    "       indent=4,\n",
    "       width=100,\n",
    "       compact=True)"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "7564c7f6",
   "metadata": {},
   "source": [
    "We can use other imported `get_dataset_*` helper functions to directly access\n",
    "the metadata and circumvent the creation of a `datasets.info.DatasetInfo` object\n",
    "as shown below"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "f50da911",
   "metadata": {},
   "outputs": [],
   "source": [
    "# get the list of all available dataset configurations/subsets in the PubChemQC-B3LYP dataset\n",
    "config_names = get_dataset_config_names(path)\n",
    "\n",
    "# print the retrieved information about the PubChemQC-B3LYP dataset\n",
    "print(\"List of available dataset configurations/subsets in the PubChemQC-B3LYP dataset:\")\n",
    "config_names"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "0a925405",
   "metadata": {},
   "outputs": [],
   "source": [
    "# get the list of all available dataset splits in the PubChemQC-B3LYP dataset\n",
    "split_names = get_dataset_split_names(path, name)\n",
    "\n",
    "# print the retrieved information about the PubChemQC-B3LYP dataset\n",
    "print(f\"List of available dataset splits in the PubChemQC-B3LYP dataset:\")\n",
    "split_names"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "2534c098",
   "metadata": {},
   "outputs": [],
   "source": [
    "# get the default configuration name of the PubChemQC-B3LYP dataset\n",
    "default_config_name = get_dataset_default_config_name(path)\n",
    "\n",
    "# print the retrieved information about the PubChemQC-B3LYP dataset\n",
    "print(f\"Default configuration name of the PubChemQC-B3LYP dataset: {default_config_name}\")"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "5e477ae9",
   "metadata": {},
   "source": [
    "We can also list the available features in our dataset"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "edefb732",
   "metadata": {},
   "outputs": [],
   "source": [
    "list(config_info.features)"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "a6ad4a42",
   "metadata": {},
   "source": [
    "The `list()` function can be removed from the aforementioned command in order to\n",
    "create a dictionary of features alongside their corresponding data types.\n",
    "\n",
    "We can also access the citation information using the `citation` attribute\n",
    "as shown below"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "2749b3e7",
   "metadata": {},
   "outputs": [],
   "source": [
    "pprint(config_info.citation,\n",
    "       indent=4,\n",
    "       width=100,\n",
    "       compact=True)"
   ]
  }
 ],
 "metadata": {
  "kernelspec": {
   "display_name": "Python 3 (ipykernel)",
   "language": "python",
   "name": "python3"
  },
  "language_info": {
   "codemirror_mode": {
    "name": "ipython",
    "version": 3
   },
   "file_extension": ".py",
   "mimetype": "text/x-python",
   "name": "python",
   "nbconvert_exporter": "python",
   "pygments_lexer": "ipython3",
   "version": "3.10.13"
  }
 },
 "nbformat": 4,
 "nbformat_minor": 5
}