{ "cells": [ { "cell_type": "markdown", "id": "4b572b87", "metadata": {}, "source": [ "# Working with Metadata" ] }, { "cell_type": "markdown", "id": "a11900d1", "metadata": {}, "source": [ "This demo focuses on getting to know the data we are going to work with before\n", "downloading it and start processing it. Here, we are going to use the\n", "[PubChemQC-B3LYP/6-31G*//PM6\n", "Dataset](https://huggingface.co/datasets/molssiai-hub/pubchemqc-b3lyp) (PubChemQC-B3LYP for short)\n", "from the [MolSSI AI Hub](https://huggingface.co/molssiai-hub).\n", "\n", "\n", "In order to be able to interact with the data, we need to import the necessary libraries." ] }, { "cell_type": "code", "execution_count": null, "id": "2493a0c3-4b27-496a-9514-32fb4941c94e", "metadata": { "tags": [] }, "outputs": [], "source": [ "import datasets # Hugging Face datasets library\n", "from datasets import (\n", " get_dataset_config_info, # Get information about a dataset configurations/subsets\n", " get_dataset_config_names, # Get the list of names of all dataset configurations/subsets\n", " get_dataset_split_names, # Get the list of names of all dataset splits\n", " get_dataset_default_config_name # Get the default configuration name of a dataset\n", ")\n", "from pprint import pprint # Pretty print" ] }, { "cell_type": "markdown", "id": "b362b39e", "metadata": {}, "source": [ "After importing the modules, we set a few variables that will be used throughout\n", "this demo." ] }, { "cell_type": "code", "execution_count": null, "id": "c4892171-f79f-4eee-99db-a21d11b09e5c", "metadata": { "tags": [] }, "outputs": [], "source": [ "# path to the dataset repository on the Hugging Face Hub\n", "path = \"molssiai-hub/pubchemqc-b3lyp\"\n", "\n", "# set the dataset configuration/subset name\n", "name = \"b3lyp_pm6\"\n", "\n", "# set the dataset split\n", "split = \"train\"" ] }, { "cell_type": "markdown", "id": "affb8bf2", "metadata": {}, "source": [ "The modules and functions we imported allow us to inspect our dataset for a wide\n", "range of metadata and information without actually downloading it on disk. For\n", "example, we can access the list of all available configurations/subsets, splits,\n", "and the default configuration name in our dataset.\n", "\n", "The `get_dataset_config_info` function returns a `datasets.info.DatasetInfo` \n", "object which contains the metadata of our dataset configuration all in one place." ] }, { "cell_type": "code", "execution_count": null, "id": "b0b1b1cf", "metadata": {}, "outputs": [], "source": [ "# get the information about the PubChemQC-B3LYP dataset configuration/subset\n", "config_info = get_dataset_config_info(path, name, trust_remote_code=True)\n", "\n", "# print the retrieved information about the PubChemQC-B3LYP dataset\n", "print(\"Information about the PubChemQC-B3LYP dataset configuration/subset:\")\n", "pprint(config_info,\n", " indent=4,\n", " width=100,\n", " compact=True)" ] }, { "cell_type": "markdown", "id": "c01684ca", "metadata": {}, "source": [ "Processing a lengthy output is not always convenient. In order to make the\n", "metadata inspection easier, we can access specific attributes of the\n", "`datasets.info.DatasetInfo` instance directly. For example, the `description`\n", "attribute can provide access to the content of the *description* field in the\n", "dataset configuration as shown below" ] }, { "cell_type": "code", "execution_count": null, "id": "d7ca2dff", "metadata": {}, "outputs": [], "source": [ "pprint(config_info.description,\n", " indent=4,\n", " width=100,\n", " compact=True)" ] }, { "cell_type": "markdown", "id": "7564c7f6", "metadata": {}, "source": [ "We can use other imported `get_dataset_*` helper functions to directly access\n", "the metadata and circumvent the creation of a `datasets.info.DatasetInfo` object\n", "as shown below" ] }, { "cell_type": "code", "execution_count": null, "id": "f50da911", "metadata": {}, "outputs": [], "source": [ "# get the list of all available dataset configurations/subsets in the PubChemQC-B3LYP dataset\n", "config_names = get_dataset_config_names(path)\n", "\n", "# print the retrieved information about the PubChemQC-B3LYP dataset\n", "print(\"List of available dataset configurations/subsets in the PubChemQC-B3LYP dataset:\")\n", "config_names" ] }, { "cell_type": "code", "execution_count": null, "id": "0a925405", "metadata": {}, "outputs": [], "source": [ "# get the list of all available dataset splits in the PubChemQC-B3LYP dataset\n", "split_names = get_dataset_split_names(path, name)\n", "\n", "# print the retrieved information about the PubChemQC-B3LYP dataset\n", "print(f\"List of available dataset splits in the PubChemQC-B3LYP dataset:\")\n", "split_names" ] }, { "cell_type": "code", "execution_count": null, "id": "2534c098", "metadata": {}, "outputs": [], "source": [ "# get the default configuration name of the PubChemQC-B3LYP dataset\n", "default_config_name = get_dataset_default_config_name(path)\n", "\n", "# print the retrieved information about the PubChemQC-B3LYP dataset\n", "print(f\"Default configuration name of the PubChemQC-B3LYP dataset: {default_config_name}\")" ] }, { "cell_type": "markdown", "id": "5e477ae9", "metadata": {}, "source": [ "We can also list the available features in our dataset" ] }, { "cell_type": "code", "execution_count": null, "id": "edefb732", "metadata": {}, "outputs": [], "source": [ "list(config_info.features)" ] }, { "cell_type": "markdown", "id": "a6ad4a42", "metadata": {}, "source": [ "The `list()` function can be removed from the aforementioned command in order to\n", "create a dictionary of features alongside their corresponding data types.\n", "\n", "We can also access the citation information using the `citation` attribute\n", "as shown below" ] }, { "cell_type": "code", "execution_count": null, "id": "2749b3e7", "metadata": {}, "outputs": [], "source": [ "pprint(config_info.citation,\n", " indent=4,\n", " width=100,\n", " compact=True)" ] } ], "metadata": { "kernelspec": { "display_name": "Python 3 (ipykernel)", "language": "python", "name": "python3" }, "language_info": { "codemirror_mode": { "name": "ipython", "version": 3 }, "file_extension": ".py", "mimetype": "text/x-python", "name": "python", "nbconvert_exporter": "python", "pygments_lexer": "ipython3", "version": "3.10.13" } }, "nbformat": 4, "nbformat_minor": 5 }