# Working with Metadata

This demo focuses on getting to know the data we are going to work with before
downloading it and start processing it. Here, we are going to use the
[PubChemQC-B3LYP/6-31G*//PM6
Dataset](https://huggingface.co/datasets/molssiai-hub/pubchemqc-b3lyp) (PubChemQC-B3LYP for short)
from the [MolSSI AI Hub](https://huggingface.co/molssiai-hub).


In order to be able to interact with the data, we need to import the necessary libraries.

In [None]:
import datasets # Hugging Face datasets library
from datasets import (
 get_dataset_config_info, # Get information about a dataset configurations/subsets
 get_dataset_config_names, # Get the list of names of all dataset configurations/subsets
 get_dataset_split_names, # Get the list of names of all dataset splits
 get_dataset_default_config_name # Get the default configuration name of a dataset
)
from pprint import pprint # Pretty print

After importing the modules, we set a few variables that will be used throughout
this demo.

In [None]:
# path to the dataset repository on the Hugging Face Hub
path = "molssiai-hub/pubchemqc-b3lyp"

# set the dataset configuration/subset name
name = "b3lyp_pm6"

# set the dataset split
split = "train"

The modules and functions we imported allow us to inspect our dataset for a wide
range of metadata and information without actually downloading it on disk. For
example, we can access the list of all available configurations/subsets, splits,
and the default configuration name in our dataset.

The `get_dataset_config_info` function returns a `datasets.info.DatasetInfo` 
object which contains the metadata of our dataset configuration all in one place.

In [None]:
# get the information about the PubChemQC-B3LYP dataset configuration/subset
config_info = get_dataset_config_info(path, name, trust_remote_code=True)

# print the retrieved information about the PubChemQC-B3LYP dataset
print("Information about the PubChemQC-B3LYP dataset configuration/subset:")
pprint(config_info,
 indent=4,
 width=100,
 compact=True)

Processing a lengthy output is not always convenient. In order to make the
metadata inspection easier, we can access specific attributes of the
`datasets.info.DatasetInfo` instance directly. For example, the `description`
attribute can provide access to the content of the *description* field in the
dataset configuration as shown below

In [None]:
pprint(config_info.description,
 indent=4,
 width=100,
 compact=True)

We can use other imported `get_dataset_*` helper functions to directly access
the metadata and circumvent the creation of a `datasets.info.DatasetInfo` object
as shown below

In [None]:
# get the list of all available dataset configurations/subsets in the PubChemQC-B3LYP dataset
config_names = get_dataset_config_names(path)

# print the retrieved information about the PubChemQC-B3LYP dataset
print("List of available dataset configurations/subsets in the PubChemQC-B3LYP dataset:")
config_names

In [None]:
# get the list of all available dataset splits in the PubChemQC-B3LYP dataset
split_names = get_dataset_split_names(path, name)

# print the retrieved information about the PubChemQC-B3LYP dataset
print(f"List of available dataset splits in the PubChemQC-B3LYP dataset:")
split_names

In [None]:
# get the default configuration name of the PubChemQC-B3LYP dataset
default_config_name = get_dataset_default_config_name(path)

# print the retrieved information about the PubChemQC-B3LYP dataset
print(f"Default configuration name of the PubChemQC-B3LYP dataset: {default_config_name}")

We can also list the available features in our dataset

In [None]:
list(config_info.features)

The `list()` function can be removed from the aforementioned command in order to
create a dictionary of features alongside their corresponding data types.

We can also access the citation information using the `citation` attribute
as shown below

In [None]:
pprint(config_info.citation,
 indent=4,
 width=100,
 compact=True)