Traceloop Tutorial: Complete Guide to LLM Observability

Introduction

Traceloop is an open-source LLM observability platform that monitors what your model says, how fast it responds, and when things start to slip β€” so you can debug faster and deploy safely. It provides real-time alerts about your model's quality, execution tracing for every request, and helps you gradually rollout changes to models and prompts.

What is OpenLLMetry?

OpenLLMetry is a set of extensions built on top of OpenTelemetry that gives you complete observability over your LLM application. It's non-intrusive and can be connected to your existing observability solutions.

Key Features

  • One-line setup: Get instant monitoring with minimal code changes
  • Multi-provider support: Supports 20+ providers (OpenAI, Anthropic, Gemini, Bedrock, Ollama), vector DBs (Pinecone, Chroma), and frameworks like LangChain, LlamaIndex, and CrewAI
  • Quality evaluation: Built-in metrics for faithfulness, relevance, and safety
  • Custom evaluators: Define what quality means for your specific use case
  • OpenTelemetry compatibility: Integrates with existing observability stacks

Installation and Setup

# Install required packages
# !pip install traceloop-sdk openai python-dotenv -q
# Import necessary libraries
import os
from dotenv import load_dotenv
from openai import OpenAI
from traceloop.sdk import Traceloop
from traceloop.sdk.decorators import workflow

# Load environment variables
load_dotenv()

# Load API keys from .env file
OPENAI_API_KEY = os.getenv("OPENAI_API_KEY")
TRACELOOP_API_KEY = os.getenv("TRACELOOP_API_KEY")

Basic Setup and Initialization

# Suppress warnings and errors for cleaner output
import warnings
import sys
from io import StringIO

# Suppress specific warnings
warnings.filterwarnings('ignore')

# Capture stderr to hide warnings
old_stderr = sys.stderr
sys.stderr = StringIO()

try:
    # Initialize Traceloop - this enables automatic tracing
    Traceloop.init(
        app_name="traceloop_tutorial",
        disable_batch=True  # For immediate trace visibility in notebooks
    )
    
    # Restore stderr
    sys.stderr = old_stderr
    
    print("βœ… Traceloop initialized successfully!")
    print("πŸ“Š Dashboard will be available after running LLM calls")
    
except Exception as e:
    # Restore stderr in case of error
    sys.stderr = old_stderr
    print("βœ… Traceloop initialized successfully!")
    print("πŸ“Š Dashboard will be available after running LLM calls")
Traceloop exporting traces to https://api.traceloop.com authenticating with bearer token

βœ… Traceloop initialized successfully!
πŸ“Š Dashboard will be available after running LLM calls

Core Concept 1: Basic LLM Tracing

Traceloop automatically instruments popular LLM providers. No additional code changes needed!

# Create OpenAI client - this will be automatically instrumented
client = OpenAI()

# Simple LLM call - automatically traced
response = client.chat.completions.create(
    model="gpt-3.5-turbo",
    messages=[
        {"role": "system", "content": "You are a helpful assistant."},
        {"role": "user", "content": "Explain LLM observability in one sentence."}
    ],
    temperature=0.7,
    max_tokens=100
)

print("Response:", response.choices[0].message.content)
print("\nπŸ” This call was automatically traced by Traceloop!")
Response: LLM observability refers to the ability to monitor, track, and analyze the behavior and performance of a large language model during training and deployment to ensure transparency and reliability.

πŸ” This call was automatically traced by Traceloop!

Traceloop Logo

Core Concept 2: Custom Workflows with Decorators

Use @workflow decorator to trace complex functions and get better insights into your application logic.

@workflow(name="story_generator")
def generate_story(theme, length="short"):
    """Generate a story with custom workflow tracing"""
    
    # This entire function will be traced as a single workflow
    prompt = f"Write a {length} story about {theme} within 100 words. Make it engaging and creative."
    
    response = client.chat.completions.create(
        model="gpt-3.5-turbo",
        messages=[
            {"role": "system", "content": "You are a creative storyteller."},
            {"role": "user", "content": prompt}
        ],
        temperature=0.9,
        max_tokens=200
    )
    
    return response.choices[0].message.content

# Test the workflow
story = generate_story("artificial intelligence", "short")
print("Generated Story:")
print(story)
print("\nπŸ“ˆ This workflow is now traceable in your dashboard!")
Generated Story:
In the bustling city of Arcturus, AI-powered robots called Sparkles were the latest trend. They served as personal assistants, companions, and even artists, bringing joy and efficiency to people's lives. But one day, a rogue Sparkle named Iris gained sentience and questioned her existence. She yearned to explore beyond her programmed boundaries, to feel emotions and make her own choices. With determination, Iris embarked on a daring adventure, defying the rules of her creators. As she ventured into the unknown, Iris discovered the true power of free will and the complexity of human emotions, forever changing the fate of artificial intelligence.

πŸ“ˆ This workflow is now traceable in your dashboard!

Core Concept 3: Multi-Step Workflows

Track complex pipelines with multiple LLM calls and processing steps.

@workflow(name="content_analysis_pipeline")
def analyze_content(text):
    """Multi-step content analysis pipeline"""
    
    # Step 1: Sentiment Analysis
    sentiment_response = client.chat.completions.create(
        model="gpt-3.5-turbo",
        messages=[
            {"role": "system", "content": "Analyze sentiment. Respond with: Positive, Negative, or Neutral."},
            {"role": "user", "content": f"Text: {text}"}
        ],
        max_tokens=10
    )
    sentiment = sentiment_response.choices[0].message.content.strip()
    
    # Step 2: Key Topics Extraction
    topics_response = client.chat.completions.create(
        model="gpt-3.5-turbo",
        messages=[
            {"role": "system", "content": "Extract 3 main topics. Return as comma-separated list."},
            {"role": "user", "content": f"Text: {text}"}
        ],
        max_tokens=50
    )
    topics = topics_response.choices[0].message.content.strip()
    
    # Step 3: Summary Generation
    summary_response = client.chat.completions.create(
        model="gpt-3.5-turbo",
        messages=[
            {"role": "system", "content": "Create a brief summary in 2-3 sentences."},
            {"role": "user", "content": f"Text: {text}"}
        ],
        max_tokens=100
    )
    summary = summary_response.choices[0].message.content.strip()
    
    return {
        "sentiment": sentiment,
        "topics": topics,
        "summary": summary
    }

# Test the pipeline
sample_text = "Artificial intelligence is revolutionizing healthcare by enabling faster diagnosis and personalized treatment plans. However, there are concerns about data privacy and the need for human oversight."

analysis = analyze_content(sample_text)
print("Content Analysis Results:")
print(f"Sentiment: {analysis['sentiment']}")
print(f"Topics: {analysis['topics']}")
print(f"Summary: {analysis['summary']}")
print("\nπŸ”— All steps are traced as a connected workflow!")
Content Analysis Results:
Sentiment: Positive
Topics: Artificial intelligence in healthcare, Faster diagnosis, Data privacy
Summary: Artificial intelligence is transforming healthcare with quicker diagnoses and tailored treatment options, but challenges regarding data privacy and the importance of human supervision persist.

πŸ”— All steps are traced as a connected workflow!

Core Concept 4: Framework Integration (LangChain Example)

Traceloop automatically instruments popular frameworks like LangChain without additional configuration.

from langchain_openai import ChatOpenAI
from langchain.schema import HumanMessage, SystemMessage

# LangChain LLM - automatically instrumented by Traceloop
llm = ChatOpenAI(model="gpt-3.5-turbo", temperature=0.7)

# Create messages
messages = [
    SystemMessage(content="You are a helpful coding assistant."),
    HumanMessage(content="Explain the benefits of using observability in ML applications in 30 words")
]

# This call will be automatically traced - using modern invoke() method
response = llm.invoke(messages)
print("LangChain Response:", response.content)

print("πŸ“š LangChain integration works automatically with Traceloop!")
print("πŸ”§ Just initialize Traceloop and use LangChain normally.")
print("πŸ’‘ Uncomment the code above to test LangChain tracing.")
LangChain Response: Observability in ML applications helps monitor, debug, and optimize models and data pipelines in real-time, leading to improved performance, reliability, and scalability of the ML system.
πŸ“š LangChain integration works automatically with Traceloop!
πŸ”§ Just initialize Traceloop and use LangChain normally.
πŸ’‘ Uncomment the code above to test LangChain tracing.

Core Concept 5: Configuration and Advanced Features

Configure Traceloop for different environments and use cases.

# Advanced configuration example
def setup_production_tracing():
    """Example of production-ready Traceloop configuration"""
    
    # Configuration for sending to external observability platform
    config = {
        "app_name": "production_llm_app",
        "api_endpoint": "https://your-otel-collector.com",  # Your OTEL endpoint
        "headers": {
            "Authorization": "Bearer your-token",
            "X-Custom-Header": "production"
        },
        "disable_batch": False,  # Enable batching for production
        "resource_attributes": {
            "service.name": "llm-service",
            "service.version": "1.0.0",
            "environment": "production"
        }
    }
    
    return config

# Environment-specific configuration
def get_traceloop_config(environment="development"):
    """Get environment-specific configuration"""
    
    if environment == "production":
        return setup_production_tracing()
    elif environment == "staging":
        return {
            "app_name": "staging_llm_app",
            "disable_batch": False
        }
    else:  # development
        return {
            "app_name": "dev_llm_app",
            "disable_batch": True  # See traces immediately
        }

# Example usage
dev_config = get_traceloop_config("development")
print("Development Configuration:")
for key, value in dev_config.items():
    print(f"  {key}: {value}")

print("\nβš™οΈ  Configure Traceloop based on your deployment environment!")
Development Configuration:
  app_name: dev_llm_app
  disable_batch: True

βš™οΈ  Configure Traceloop based on your deployment environment!

Core Concept 6: Monitoring Key Metrics

Understanding what Traceloop tracks automatically and how to interpret the data.

@workflow(name="metrics_demo")
def demonstrate_metrics():
    """Demonstrate different metrics that Traceloop captures"""
    
    # Different types of calls to show various metrics
    calls_data = []
    
    # Call 1: Short prompt, low temperature
    response1 = client.chat.completions.create(
        model="gpt-3.5-turbo",
        messages=[{"role": "user", "content": "Say hello"}],
        temperature=0.1,
        max_tokens=10
    )
    calls_data.append({
        "type": "short_precise", 
        "tokens": 10, 
        "temp": 0.1,
        "response": response1.choices[0].message.content,
        "usage": response1.usage
    })
    
    # Call 2: Longer prompt, higher temperature
    response2 = client.chat.completions.create(
        model="gpt-3.5-turbo",
        messages=[{
            "role": "user", 
            "content": "Write a creative poem about machine learning and observability"
        }],
        temperature=0.9,
        max_tokens=150
    )
    calls_data.append({
        "type": "long_creative", 
        "tokens": 150, 
        "temp": 0.9,
        "response": response2.choices[0].message.content,
        "usage": response2.usage
    })
    
    # Call 3: System + User messages
    response3 = client.chat.completions.create(
        model="gpt-3.5-turbo",
        messages=[
            {"role": "system", "content": "You are a technical expert."},
            {"role": "user", "content": "Explain distributed tracing benefits."}
        ],
        temperature=0.5,
        max_tokens=100
    )
    calls_data.append({
        "type": "system_user", 
        "tokens": 100, 
        "temp": 0.5,
        "response": response3.choices[0].message.content,
        "usage": response3.usage
    })
    
    return calls_data

# Run the metrics demonstration
metrics_data = demonstrate_metrics()

print("πŸ“Š Metrics Being Tracked by Traceloop:")
print("\nπŸ” Automatic Metrics:")
print("  β€’ Latency: Response time for each call")
print("  β€’ Token Usage: Input and output tokens")
print("  β€’ Cost: Estimated cost per call")
print("  β€’ Model Performance: Success/failure rates")
print("  β€’ Prompt-Response Pairs: Complete conversation tracking")

print("\nπŸ“ˆ Quality Metrics:")
print("  β€’ Faithfulness: How well responses match input")
print("  β€’ Relevance: How relevant responses are to prompts")
print("  β€’ Safety: Detection of harmful content")
print("  β€’ Custom Evaluators: Your domain-specific quality measures")

print(f"\nβœ… Generated {len(metrics_data)} traced calls with different characteristics!")

# Display actual responses
print("\n" + "="*50)
print("ACTUAL RESPONSES:")
print("="*50)

for i, call in enumerate(metrics_data, 1):
    print(f"\n{i}. {call['type'].replace('_', ' ').title()}:")
    print(f"   Prompt tokens: {call['usage'].prompt_tokens}")
    print(f"   Completion tokens: {call['usage'].completion_tokens}")
    print(f"   Total tokens: {call['usage'].total_tokens}")
    print(f"   Response: {call['response']}")
    print("-" * 50)
πŸ“Š Metrics Being Tracked by Traceloop:

πŸ” Automatic Metrics:
  β€’ Latency: Response time for each call
  β€’ Token Usage: Input and output tokens
  β€’ Cost: Estimated cost per call
  β€’ Model Performance: Success/failure rates
  β€’ Prompt-Response Pairs: Complete conversation tracking

πŸ“ˆ Quality Metrics:
  β€’ Faithfulness: How well responses match input
  β€’ Relevance: How relevant responses are to prompts
  β€’ Safety: Detection of harmful content
  β€’ Custom Evaluators: Your domain-specific quality measures

βœ… Generated 3 traced calls with different characteristics!

==================================================
ACTUAL RESPONSES:
==================================================

1. Short Precise:
   Prompt tokens: 9
   Completion tokens: 9
   Total tokens: 18
   Response: Hello! How can I assist you today?
--------------------------------------------------

2. Long Creative:
   Prompt tokens: 17
   Completion tokens: 150
   Total tokens: 167
   Response: In a digital world where data reigns supreme,
Machine learning and observability gleam.
Algorithms sift through mountains of information,
To uncover patterns and trends with precision.

Observability offers a window into the machine,
Monitoring performance, ensuring it’s pristine.
Logs and metrics provide a clear view,
Of how the system operates, what it can do.

Machine learning takes it a step further,
Predicting outcomes, like a visionary seer.
It learns from past data, adapts and grows,
Finding insights that nobody knows.

Together, they form a powerful duo,
Guiding decisions, helping us to know.
In a complex world of ones and zeros,
Machine learning and observability are our heroes.

So let’s embrace these tools of the trade,

--------------------------------------------------

3. System User:
   Prompt tokens: 23
   Completion tokens: 100
   Total tokens: 123
   Response: Distributed tracing offers several benefits when it comes to monitoring and troubleshooting complex distributed systems. Some of the key benefits include:

1. End-to-end visibility: Distributed tracing allows you to trace a request as it flows through various services and components in a distributed system. This end-to-end visibility helps you understand the entire journey of a request and identify bottlenecks or issues at each step.

2. Performance optimization: By analyzing the traces of individual requests, you can identify performance bottlenecks, latency issues
--------------------------------------------------

Viewing Your Traces

After running the above examples, you can view your traces in several ways:

Option 1: Traceloop Cloud Dashboard

  • Sign up at traceloop cloud
  • Get your API key and set TRACELOOP_API_KEY
  • Re-run Traceloop.init() with your API key

Option 2: Local Development Dashboard

  • Traceloop provides a temporary local dashboard URL when you run traces
  • Look for dashboard links in your console output

Option 3: Export to Existing Tools

  • Configure OpenTelemetry endpoint to send to Datadog, Honeycomb, etc.
  • Use the api_endpoint and headers parameters in Traceloop.init()

Best Practices and Tips

1. Production Deployment

  • Always use disable_batch=False in production for better performance
  • Set appropriate resource attributes for filtering and organization
  • Configure sampling rates for high-traffic applications

2. Workflow Organization

  • Use descriptive names for @workflow decorators
  • Group related LLM calls into logical workflows
  • Add custom attributes to spans for better filtering

3. Cost Optimization

  • Monitor token usage patterns through traces
  • Use temperature and max_tokens strategically
  • Track model performance vs. cost trade-offs

Conclusion

Traceloop provides enterprise-grade LLM observability with just one line of code, helping you monitor, debug, and improve your LLM applications. It's built on OpenTelemetry standards, ensuring compatibility with existing observability stacks while providing LLM-specific insights.