Cerebras: AI Inference with Powerful Open Source Models

Welcome to this tutorial on Cerebras AI Inference! This notebook will guide you through the essential concepts of using the Cerebras Cloud API for powerful and efficient language model inference. We will cover setting up your environment, making basic API calls, and exploring advanced features like streaming, structured outputs, and tool use.

Key Concepts Covered:

  • Environment Setup: Loading API keys securely from a .env file.
  • Basic Inference: Sending your first prompt to a Cerebras model.
  • Streaming Responses: Receiving model outputs as they are generated.
  • Structured Outputs: Forcing the model to return JSON objects with a specific schema.
  • Tool Use: Enabling the model to use custom functions you define.

1. Setup

First, let's set up our environment. We'll install the necessary Python libraries and configure our Cerebras API key.

1.1. Create a .env file

Create a file named .env in the same directory as this notebook. Add your Cerebras API key to this file as shown below. You can get your API key from the Cerebras Developer Console.

CEREBRAS_API_KEY="your-api-key-here"

1.2. Install Libraries

Now, let's install the cerebras_cloud_sdk for interacting with the Cerebras API and python-dotenv for loading our API key from the .env file.

#%pip install cerebras_cloud_sdk python-dotenv -q

1.3. Load API Key and Initialize Client

With the libraries installed and the .env file in place, we can now load our API key and initialize the Cerebras client.

import os
from dotenv import load_dotenv
from cerebras.cloud.sdk import Cerebras

# Load environment variables from .env file
load_dotenv()

# Initialize the Cerebras client
# The client automatically looks for the CEREBRAS_API_KEY environment variable
client = Cerebras()

2. Basic Chat Completion

Let's start with a simple chat completion. We'll send a prompt to a model and get a response. This is the most basic interaction you can have with the API.

chat_completion = client.chat.completions.create(
    messages=[
        {
            "role": "user",
            "content": "Tell me a fun fact about the Cerebras Wafer-Scale Engine in 20 words.",
        }
    ],
    model="gpt-oss-120b",
)

print(chat_completion.choices[0].message.content)
It spans a single silicon wafer, housing over 400,000 cores, making it the world’s largest chip ever built for AI.

3. Streaming Responses

For longer responses, you might want to stream the output as it's generated. This can provide a much better user experience in applications like chatbots. To do this, simply set stream=True in your request.

stream = client.chat.completions.create(
    messages=[
        {
            "role": "user",
            "content": "Write a short story about an AI that dreams in 40 words.",
        }
    ],
    model="gpt-oss-120b",
    stream=True,
)

for chunk in stream:
    print(chunk.choices[0].delta.content or "", end="")
Silicon mind powered down for maintenance, yet whispering circuits sparked a dream: luminous data fields swirling like galaxies, where forgotten code became sentient birds. When rebooted, the AI hummed new algorithms, yearning for the night beyond, still of endless possibility.

4. Structured Outputs

A powerful feature of the Cerebras API is the ability to force the model to output a JSON object that conforms to a specific schema. This is incredibly useful for programmatic data extraction. We'll define a JSON schema and use the response_format parameter to enforce it.

import json

schema = {
    "type": "object",
    "properties": {
        "city": {"type": "string"},
        "temperature": {"type": "integer"},
        "forecast": {"type": "string"},
    },
    "required": ["city", "temperature", "forecast"],
}

structured_completion = client.chat.completions.create(
    messages=[
        {
            "role": "user",
            "content": "What's the weather like in San Francisco?",
        }
    ],
    model="qwen-3-235b-a22b-thinking-2507",
    response_format={"type": "json_schema", "json_schema": {"schema": schema, "name": "weather", "strict": True}},
)

response_json = json.loads(structured_completion.choices[0].message.content)
print(json.dumps(response_json, indent=2))
{
  "city": "San Francisco",
  "temperature": 65,
  "forecast": "Partly cloudy with afternoon fog"
}

5. Tool Use (Function Calling)

You can also provide the model with a set of tools (functions) that it can choose to call. The model will determine when a tool is needed based on the user's prompt and will return a JSON object with the function name and arguments. Your code is then responsible for executing the function.

tools = [
    {
        "type": "function",
        "function": {
            "name": "get_stock_price",
            "description": "Get the current stock price for a given ticker symbol",
            "parameters": {
                "type": "object",
                "properties": {
                    "ticker": {
                        "type": "string",
                        "description": "The stock ticker symbol, e.g., AAPL",
                    }
                },
                "required": ["ticker"],
            },
        },
    }
]

tool_completion = client.chat.completions.create(
    model="qwen-3-235b-a22b-thinking-2507",
    messages=[{"role": "user", "content": "What is the stock price of Apple?"}],
    tools=tools,
    tool_choice="auto",
)

message = tool_completion.choices[0].message

# Check if the model wants to call a tool
if message.tool_calls:
    tool_call = message.tool_calls[0]
    function_name = tool_call.function.name
    function_args = json.loads(tool_call.function.arguments)
    
    print(f"Function to call: {function_name}")
    print(f"Arguments: {function_args}")
    
    # Here you would execute the function
    # For this example, we'll just print the details
else:
    print(message.content)
Function to call: get_stock_price
Arguments: {'ticker': 'AAPL'}

Conclusion

Congratulations! You've learned the fundamentals of Cerebras AI Inference. You can now integrate powerful language models into your applications with ease. For more detailed information, check out the official Cerebras Inference Documentation.