How to use OpenAI gpt-oss

This guide walks you through using OpenAI’s latest gpt-oss models with Hugging Face Inference Providers, which is the same infra that powers the official OpenAI playground (gpt-oss.com). OpenAI gpt-oss is an open-weights family built for strong reasoning, agentic workflows and versatile developer use cases, and it comes in two sizes: a version with 120B parameters gpt-oss-120b, and a smaller one with 20B parameters (gpt-oss-20b).

Both models are supported on Inference Providers and can be accessed through either the OpenAI-compatible Chat Completions API, or the more advanced Responses API.

Quickstart

You’ll need your Hugging Face token. Get one from your settings page. Then, set it as an environment variable.

export HF_TOKEN="your_token_here"

💡 Pro tip: The free tier gives you monthly inference credits to start building and experimenting. Upgrade to Hugging Face PRO for even more flexibility, $2 in monthly credits plus pay‑as‑you‑go access to all providers!

Install the official OpenAI SDK.

python

javascript

Chat Completion

Getting started with gpt-oss models on Inference Providers is simple and straightforward. The OpenAI-compatible Chat Completions API supports features like tool calling, structured outputs, streaming, and reasoning effort controls.

Here’s a basic example using gpt-oss-120b through the fast Cerebras provider:

python

javascript

You can also give the model access to tools. Below, we define a get_current_weather function and let the model decide whether to call it:

python

javascript

For structured tasks like data extraction, you can force the model to return a valid JSON object using the response_format parameter. We use the Fireworks AI provider.

python

javascript

With just a few lines of code, you can start using gpt-oss models with Hugging Face Inference Providers, fully OpenAI API-compatible, easy to integrate, and ready out of the box!

Responses API

Inference Providers implements the OpenAI-compatible Responses API, the most advanced interface for chat-based models. It supports streaming, structured outputs, tool calling, reasoning effort controls (low, medium, hard), and Remote MCP calls to delegate tasks to external services.

Key Advantages:

Agent-Oriented Design: The API is specifically built to simplify workflows for agentic tasks. It has a native framework for integrating complex tool use, such as Remote MCP calls.
Stateful, Event-Driven Architecture: Features a stateful, event-driven architecture. Instead of resending the entire text on every update, it streams semantic events that describe only the precise change (the “delta”). This eliminates the need for manual state tracking.
Simplified Development for Complex Logic: The event-driven model makes it easier to build reliable applications with multi-step logic. Your code simply listens for specific events, leading to cleaner and more robust integrations.

The implementation is based on the open-source huggingface/responses.js project.

Stream responses

Unlike traditional text streaming, the Responses API uses a system of semantic events for streaming. This means the stream is not just raw text, but a series of structured event objects. Each event has a type, so you can listen for the specific events you care about, such as content being added (output_text.delta) or the message being completed (completed). The example below shows how to iterate through these events and print the content as it arrives.

python

javascript

Tool Calling

You can extend the model with tools to access external data. The example below defines a get_current_weather function that the model can choose to call.

python

javascript

Remote MCP Calls

The API’s most advanced feature is Remote MCP calls, which allow the model to delegate tasks to external services. Calling a remote MCP server with the Responses API is straightforward. For example, here’s how you can use the DeepWiki MCP server to ask questions about nearly any public GitHub repository.

python

javascript

Reasoning Effort

You can also control the model’s “thinking” time with the reasoning parameter. The following example nudges the model to spend a medium amount of effort on the answer.

python

javascript

That’s it! With the Responses API on Inference Providers, you get fine-grained control over powerful open-weight models like gpt-oss, including streaming, tool calling, and remote MCP, making it ideal for building reliable, agent-driven applications.

Update on GitHub