Spaces:
Running
on
Zero
Running
on
Zero
| # Hugging Face Hub Client library | |
| ## Download files from the Hub | |
| The `hf_hub_download()` function is the main function to download files from the Hub. One | |
| advantage of using it is that files are cached locally, so you won't have to | |
| download the files multiple times. If there are changes in the repository, the | |
| files will be automatically downloaded again. | |
| ### `hf_hub_download` | |
| The function takes the following parameters, downloads the remote file, | |
| stores it to disk (in a version-aware way) and returns its local file path. | |
| Parameters: | |
| - a `repo_id` (a user or organization name and a repo name, separated by `/`, like `julien-c/EsperBERTo-small`) | |
| - a `filename` (like `pytorch_model.bin`) | |
| - an optional Git revision id (can be a branch name, a tag, or a commit hash) | |
| - a `cache_dir` which you can specify if you want to control where on disk the | |
| files are cached. | |
| ```python | |
| from huggingface_hub import hf_hub_download | |
| hf_hub_download("lysandre/arxiv-nlp", filename="config.json") | |
| ``` | |
| ### `snapshot_download` | |
| Using `hf_hub_download()` works well when you know which files you want to download; | |
| for example a model file alongside a configuration file, both with static names. | |
| There are cases in which you will prefer to download all the files of the remote | |
| repository at a specified revision. That's what `snapshot_download()` does. It | |
| downloads and stores a remote repository to disk (in a versioning-aware way) and | |
| returns its local file path. | |
| Parameters: | |
| - a `repo_id` in the format `namespace/repository` | |
| - a `revision` on which the repository will be downloaded | |
| - a `cache_dir` which you can specify if you want to control where on disk the | |
| files are cached | |
| ### `hf_hub_url` | |
| Internally, the library uses `hf_hub_url()` to return the URL to download the actual files: | |
| `https://huggingface.co/julien-c/EsperBERTo-small/resolve/main/pytorch_model.bin` | |
| Parameters: | |
| - a `repo_id` (a user or organization name and a repo name separated by a `/`, like `julien-c/EsperBERTo-small`) | |
| - a `filename` (like `pytorch_model.bin`) | |
| - an optional `subfolder`, corresponding to a folder inside the model repo | |
| - an optional `repo_type`, such as `dataset` or `space` | |
| - an optional Git revision id (can be a branch name, a tag, or a commit hash) | |
| If you check out this URL's headers with a `HEAD` http request (which you can do | |
| from the command line with `curl -I`) for a few different files, you'll see | |
| that: | |
| - small files are returned directly | |
| - large files (i.e. the ones stored through | |
| [git-lfs](https://git-lfs.github.com/)) are returned via a redirect to a | |
| Cloudfront URL. Cloudfront is a Content Delivery Network, or CDN, that ensures | |
| that downloads are as fast as possible from anywhere on the globe. | |
| <br> | |
| ## Publish files to the Hub | |
| If you've used Git before, this will be very easy since Git is used to manage | |
| files in the Hub. You can find a step-by-step guide on how to upload your model | |
| to the Hub: https://huggingface.co/docs/hub/adding-a-model. | |
| ### API utilities in `hf_api.py` | |
| You don't need them for the standard publishing workflow (ie. using git command line), however, if you need a | |
| programmatic way of creating a repo, deleting it (`⚠️ caution`), pushing a | |
| single file to a repo or listing models from the Hub, you'll find helpers in | |
| `hf_api.py`. Some example functionality available with the `HfApi` class: | |
| * `whoami()` | |
| * `create_repo()` | |
| * `list_repo_files()` | |
| * `list_repo_objects()` | |
| * `delete_repo()` | |
| * `update_repo_visibility()` | |
| * `create_commit()` | |
| * `upload_file()` | |
| * `delete_file()` | |
| * `delete_folder()` | |
| Those API utilities are also exposed through the `huggingface-cli` CLI: | |
| ```bash | |
| huggingface-cli login | |
| huggingface-cli logout | |
| huggingface-cli whoami | |
| huggingface-cli repo create | |
| ``` | |
| With the `HfApi` class there are methods to query models, datasets, and metrics by specific tags (e.g. if you want to list models compatible with your library): | |
| - **Models**: | |
| - `list_models()` | |
| - `model_info()` | |
| - `get_model_tags()` | |
| - **Datasets**: | |
| - `list_datasets()` | |
| - `dataset_info()` | |
| - `get_dataset_tags()` | |
| - **Spaces**: | |
| - `list_spaces()` | |
| - `space_info()` | |
| These lightly wrap around the API Endpoints. Documentation for valid parameters and descriptions can be found [here](https://huggingface.co/docs/hub/endpoints). | |
| ### Advanced programmatic repository management | |
| The `Repository` class helps manage both offline Git repositories and Hugging | |
| Face Hub repositories. Using the `Repository` class requires `git` and `git-lfs` | |
| to be installed. | |
| Instantiate a `Repository` object by calling it with a path to a local Git | |
| clone/repository: | |
| ```python | |
| >>> from huggingface_hub import Repository | |
| >>> repo = Repository("<path>/<to>/<folder>") | |
| ``` | |
| The `Repository` takes a `clone_from` string as parameter. This can stay as | |
| `None` for offline management, but can also be set to any URL pointing to a Git | |
| repo to clone that repository in the specified directory: | |
| ```python | |
| >>> repo = Repository("huggingface-hub", clone_from="https://github.com/huggingface/huggingface_hub") | |
| ``` | |
| The `clone_from` method can also take any Hugging Face model ID as input, and | |
| will clone that repository: | |
| ```python | |
| >>> repo = Repository("w2v2", clone_from="facebook/wav2vec2-large-960h-lv60") | |
| ``` | |
| If the repository you're cloning is one of yours or one of your organisation's, | |
| then having the ability to commit and push to that repository is important. In | |
| order to do that, you should make sure to be logged-in using `huggingface-cli | |
| login`, and to have the `token` parameter set to `True` (the default) | |
| when instantiating the `Repository` object: | |
| ```python | |
| >>> repo = Repository("my-model", clone_from="<user>/<model_id>", token=True) | |
| ``` | |
| This works for models, datasets and spaces repositories; but you will need to | |
| explicitely specify the type for the last two options: | |
| ```python | |
| >>> repo = Repository("my-dataset", clone_from="<user>/<dataset_id>", token=True, repo_type="dataset") | |
| ``` | |
| You can also change between branches: | |
| ```python | |
| >>> repo = Repository("huggingface-hub", clone_from="<user>/<dataset_id>", revision='branch1') | |
| >>> repo.git_checkout("branch2") | |
| ``` | |
| The `clone_from` method can also take any Hugging Face model ID as input, and | |
| will clone that repository: | |
| ```python | |
| >>> repo = Repository("w2v2", clone_from="facebook/wav2vec2-large-960h-lv60") | |
| ``` | |
| Finally, you can choose to specify the Git username and email attributed to that | |
| clone directly by using the `git_user` and `git_email` parameters. When | |
| committing to that repository, Git will therefore be aware of who you are and | |
| who will be the author of the commits: | |
| ```python | |
| >>> repo = Repository( | |
| ... "my-dataset", | |
| ... clone_from="<user>/<dataset_id>", | |
| ... token=True, | |
| ... repo_type="dataset", | |
| ... git_user="MyName", | |
| ... git_email="[email protected]" | |
| ... ) | |
| ``` | |
| The repository can be managed through this object, through wrappers of | |
| traditional Git methods: | |
| - `git_add(pattern: str, auto_lfs_track: bool)`. The `auto_lfs_track` flag | |
| triggers auto tracking of large files (>10MB) with `git-lfs` | |
| - `git_commit(commit_message: str)` | |
| - `git_pull(rebase: bool)` | |
| - `git_push()` | |
| - `git_checkout(branch)` | |
| The `git_push` method has a parameter `blocking` which is `True` by default. When set to `False`, the push will | |
| happen behind the scenes - which can be helpful if you would like your script to continue on while the push is | |
| happening. | |
| LFS-tracking methods: | |
| - `lfs_track(pattern: Union[str, List[str]], filename: bool)`. Setting | |
| `filename` to `True` will use the `--filename` parameter, which will consider | |
| the pattern(s) as filenames, even if they contain special glob characters. | |
| - `lfs_untrack()`. | |
| - `auto_track_large_files()`: automatically tracks files that are larger than | |
| 10MB. Make sure to call this after adding files to the index. | |
| On top of these unitary methods lie some useful additional methods: | |
| - `push_to_hub(commit_message)`: consecutively does `git_add`, `git_commit` and | |
| `git_push`. | |
| - `commit(commit_message: str, track_large_files: bool)`: this is a context | |
| manager utility that handles committing to a repository. This automatically | |
| tracks large files (>10Mb) with `git-lfs`. The `track_large_files` argument can | |
| be set to `False` if you wish to ignore that behavior. | |
| These two methods also have support for the `blocking` parameter. | |
| Examples using the `commit` context manager: | |
| ```python | |
| >>> with Repository("text-files", clone_from="<user>/text-files", token=True).commit("My first file :)"): | |
| ... with open("file.txt", "w+") as f: | |
| ... f.write(json.dumps({"hey": 8})) | |
| ``` | |
| ```python | |
| >>> import torch | |
| >>> model = torch.nn.Transformer() | |
| >>> with Repository("torch-model", clone_from="<user>/torch-model", token=True).commit("My cool model :)"): | |
| ... torch.save(model.state_dict(), "model.pt") | |
| ``` | |
| ### Non-blocking behavior | |
| The pushing methods have access to a `blocking` boolean parameter to indicate whether the push should happen | |
| asynchronously. | |
| In order to see if the push has finished or its status code (to spot a failure), one should use the `command_queue` | |
| property on the `Repository` object. | |
| For example: | |
| ```python | |
| from huggingface_hub import Repository | |
| repo = Repository("<local_folder>", clone_from="<user>/<model_name>") | |
| with repo.commit("Commit message", blocking=False): | |
| # Save data | |
| last_command = repo.command_queue[-1] | |
| # Status of the push command | |
| last_command.status | |
| # Will return the status code | |
| # -> -1 will indicate the push is still ongoing | |
| # -> 0 will indicate the push has completed successfully | |
| # -> non-zero code indicates the error code if there was an error | |
| # if there was an error, the stderr may be inspected | |
| last_command.stderr | |
| # Whether the command finished or if it is still ongoing | |
| last_command.is_done | |
| # Whether the command errored-out. | |
| last_command.failed | |
| ``` | |
| When using `blocking=False`, the commands will be tracked and your script will exit only when all pushes are done, even | |
| if other errors happen in your script (a failed push counts as done). | |
| ### Need to upload very large (>5GB) files? | |
| To upload large files (>5GB 🔥) from git command-line, you need to install the custom transfer agent | |
| for git-lfs, bundled in this package. | |
| To install, just run: | |
| ```bash | |
| $ huggingface-cli lfs-enable-largefiles | |
| ``` | |
| This should be executed once for each model repo that contains a model file | |
| >5GB. If you just try to push a file bigger than 5GB without running that | |
| command, you will get an error with a message reminding you to run it. | |
| Finally, there's a `huggingface-cli lfs-multipart-upload` command but that one | |
| is internal (called by lfs directly) and is not meant to be called by the user. | |
| <br> | |
| ## Using the Inference API wrapper | |
| `huggingface_hub` comes with a wrapper client to make calls to the Inference | |
| API! You can find some examples below, but we encourage you to visit the | |
| Inference API | |
| [documentation](https://api-inference.huggingface.co/docs/python/html/detailed_parameters.html) | |
| to review the specific parameters for the different tasks. | |
| When you instantiate the wrapper to the Inference API, you specify the model | |
| repository id. The pipeline (`text-classification`, `text-to-speech`, etc) is | |
| automatically extracted from the | |
| [repository](https://huggingface.co/docs/hub/main#how-is-a-models-type-of-inference-api-and-widget-determined), | |
| but you can also override it as shown below. | |
| ### Examples | |
| Here is a basic example of calling the Inference API for a `fill-mask` task | |
| using the `bert-base-uncased` model. The `fill-mask` task only expects a string | |
| (or list of strings) as input. | |
| ```python | |
| from huggingface_hub.inference_api import InferenceApi | |
| inference = InferenceApi("bert-base-uncased", token=API_TOKEN) | |
| inference(inputs="The goal of life is [MASK].") | |
| >> [{'sequence': 'the goal of life is life.', 'score': 0.10933292657136917, 'token': 2166, 'token_str': 'life'}] | |
| ``` | |
| This is an example of a task (`question-answering`) which requires a dictionary | |
| as input thas has the `question` and `context` keys. | |
| ```python | |
| inference = InferenceApi("deepset/roberta-base-squad2", token=API_TOKEN) | |
| inputs = {"question":"What's my name?", "context":"My name is Clara and I live in Berkeley."} | |
| inference(inputs) | |
| >> {'score': 0.9326569437980652, 'start': 11, 'end': 16, 'answer': 'Clara'} | |
| ``` | |
| Some tasks might also require additional params in the request. Here is an | |
| example using a `zero-shot-classification` model. | |
| ```python | |
| inference = InferenceApi("typeform/distilbert-base-uncased-mnli", token=API_TOKEN) | |
| inputs = "Hi, I recently bought a device from your company but it is not working as advertised and I would like to get reimbursed!" | |
| params = {"candidate_labels":["refund", "legal", "faq"]} | |
| inference(inputs, params) | |
| >> {'sequence': 'Hi, I recently bought a device from your company but it is not working as advertised and I would like to get reimbursed!', 'labels': ['refund', 'faq', 'legal'], 'scores': [0.9378499388694763, 0.04914155602455139, 0.013008488342165947]} | |
| ``` | |
| Finally, there are some models that might support multiple tasks. For example, | |
| `sentence-transformers` models can do `sentence-similarity` and | |
| `feature-extraction`. You can override the configured task when initializing the | |
| API. | |
| ```python | |
| inference = InferenceApi("bert-base-uncased", task="feature-extraction", token=API_TOKEN) | |
| ``` | |