Optimizing for Cost

Strategies to control token usage and reduce API costs

Running LLM agents in production can be expensive if not managed carefully. This guide outlines practical strategies to keep your costs under control without sacrificing the quality of your agent's responses.

1. Choose the Right Model

Not every task requires the most powerful model. Match your model choice to the task complexity.

Model Selection Strategy

from peargent import create_agent
from peargent.models import groq

# Use smaller models for simple tasks
classifier_agent = create_agent(
    name="Classifier",
    description="Classifies user intent",
    persona="You classify user messages into categories: support, sales, or general.",
    model=groq("llama-3.1-8b")  # - Smaller, cheaper model
)

# Use larger models only for complex tasks
reasoning_agent = create_agent(
    name="Reasoner",
    description="Solves complex problems",
    persona="You solve complex reasoning and coding problems step by step.",
    model=groq("llama-3.3-70b-versatile")  # - Larger model when needed
)

Guidelines:

Simple tasks (classification, extraction, summarization): Use smaller models (8B parameters)
Complex tasks (reasoning, coding, analysis): Use larger models (70B+ parameters)
Test different models on your specific use case to find the best cost/quality balance

Track Model Costs with Custom Pricing

If you're using custom or local models, add their pricing to track costs accurately:

from peargent.observability import enable_tracing, get_tracer

tracer = enable_tracing()

# Add custom pricing for your model (prices per million tokens)
tracer.add_custom_pricing(  
    model="my-fine-tuned-model",
    prompt_price=1.50,      # $1.50 per million prompt tokens
    completion_price=3.00   # $3.00 per million completion tokens
)

# Now cost tracking works for your custom model
agent = create_agent(
    name="CustomAgent",
    model=my_custom_model,
    persona="You are helpful.",
    tracing=True
)

2. Control Context with History Management

The context window is your biggest cost driver. Every message in the conversation history is re-sent with each request.

Configure Automatic Context Management

Use HistoryConfig to automatically manage conversation history:

from peargent import create_agent, HistoryConfig
from peargent.storage import InMemory
from peargent.models import groq

agent = create_agent(
    name="CostOptimizedAgent",
    description="Agent with automatic history management",
    persona="You are a helpful assistant.",
    model=groq("llama-3.3-70b-versatile"),
    history=HistoryConfig(  
        auto_manage_context=True,        # Enable automatic management
        max_context_messages=10,         # Keep only last 10 messages
        strategy="trim_last",            # Remove oldest messages when limit reached
        store=InMemory()
    )
)

Available Context Strategies

Peargent supports 5 context management strategies:

Strategy	How It Works	Use When	Cost Impact
`"trim_last"`	Removes oldest messages	Simple conversations	✅ Low - fast, no LLM calls
`"trim_first"`	Keeps oldest messages	Important initial context	✅ Low - fast, no LLM calls
`"first_last"`	Keeps first and last messages	Preserving original context	✅ Low - fast, no LLM calls
`"summarize"`	Summarizes old messages	Complex conversations	⚠️ Medium - requires LLM call
`"smart"`	Chooses best strategy automatically	General purpose	⚠️ Variable - may use LLM

Example: Trim Strategy (Recommended for Cost)

# Most cost-effective - no LLM calls for management
history=HistoryConfig(
    auto_manage_context=True,
    max_context_messages=10,  # - Only keep 10 messages
    strategy="trim_last",     # - Drop oldest messages
    store=InMemory()
)

Example: Summarize Strategy (Better Context Retention)

# Uses LLM to summarize old messages - costs more but retains context
history=HistoryConfig(
    auto_manage_context=True,
    max_context_messages=20,
    strategy="summarize",           # - Summarize old messages
    summarize_model=groq("llama-3.1-8b"),  # - Use cheap model for summaries
    store=InMemory()
)

Example: Smart Strategy (Balanced)

# Automatically chooses between trim and summarize
history=HistoryConfig(
    auto_manage_context=True,
    max_context_messages=15,
    strategy="smart",  # - Automatically adapts
    store=InMemory()
)

3. Limit Output Length with max_tokens

Control how much the agent can generate by setting max_tokens in model parameters:

from peargent import create_agent
from peargent.models import groq

# Limit output to reduce costs
agent = create_agent(
    name="BriefAgent",
    description="Gives brief responses",
    persona="You provide concise, brief answers. Maximum 2-3 sentences.",
    model=groq(
        "llama-3.3-70b-versatile",
        parameters={
            "max_tokens": 150,      # - Limit to ~150 tokens output
            "temperature": 0.7
        }
    )
)

response = agent.run("Explain quantum computing")
# Agent cannot generate more than 150 tokens

Guidelines:

Short answers: max_tokens=150 (~100 words)
Medium answers: max_tokens=500 (~350 words)
Long answers: max_tokens=2000 (~1500 words)
Code generation: max_tokens=4096 or higher

Move Examples to Tool Descriptions

Instead of putting examples in the persona, put them in tool descriptions:

from peargent import create_agent, create_tool

def search_database(query: str) -> str:
    # Implementation...
    return "Results found"

agent = create_agent(
    name="ProductAgent",
    persona="You help with product inquiries.",  # Short persona
    model=groq("llama-3.3-70b-versatile"),
    tools=[create_tool(
        name="search_database",
        description="""Searches the product database for matching items.

        Use this tool when users ask about products, inventory, or availability.
        Examples: "Do we have red shirts?" → use this tool with query="red shirts"
                  "Check stock for item #123" → use this tool with query="item 123"
        """,  # - Examples in tool description, not persona
        input_parameters={"query": str},
        call_function=search_database
    )]
)

4. Control Temperature for Deterministic Outputs

Lower temperature reduces token usage for tasks that need deterministic outputs:

from peargent import create_agent
from peargent.models import groq

# For deterministic tasks (extraction, classification)
extraction_agent = create_agent(
    name="Extractor",
    description="Extracts structured data",
    persona="Extract the requested information exactly as it appears.",
    model=groq(
        "llama-3.3-70b-versatile",
        parameters={
            "temperature": 0.0,  # - Deterministic, shorter outputs
            "max_tokens": 500
        }
    )
)

# For creative tasks (writing, brainstorming)
creative_agent = create_agent(
    name="Writer",
    description="Writes creative content",
    persona="You write engaging, creative content.",
    model=groq(
        "llama-3.3-70b-versatile",
        parameters={
            "temperature": 0.9,  # - More creative, longer outputs
            "max_tokens": 2000
        }
    )
)

5. Monitor Costs with Tracing

You can't optimize what you can't measure. Use Peargent's observability features to track costs.

Enable Cost Tracking

from peargent import create_agent
from peargent.observability import enable_tracing
from peargent.storage import Sqlite
from peargent.models import groq

# Enable tracing with database storage
tracer = enable_tracing(
    store_type=Sqlite(connection_string="sqlite:///./traces.db")
)

agent = create_agent(
    name="TrackedAgent",
    description="Agent with cost tracking",
    persona="You are helpful.",
    model=groq("llama-3.3-70b-versatile"),
    tracing=True  # - Enable tracing for this agent
)

# Use the agent
response = agent.run("Hello")

# Check costs
traces = tracer.list_traces()
latest = traces[-1]

print(f"Cost: ${latest.total_cost:.6f}")
print(f"Tokens: {latest.total_tokens}")
print(f"Duration: {latest.duration_ms}ms")

Analyze Cost Patterns

from peargent.observability import get_tracer

tracer = get_tracer()

# Get aggregate statistics
stats = tracer.get_aggregate_stats()  

print(f"Total Traces: {stats['total_traces']}")
print(f"Total Cost: ${stats['total_cost']:.6f}")
print(f"Average Cost per Trace: ${stats['avg_cost_per_trace']:.6f}")
print(f"Total Tokens: {stats['total_tokens']:,}")

# Find expensive operations
traces = tracer.list_traces()
expensive_traces = sorted(traces, key=lambda t: t.total_cost, reverse=True)[:5]

print("\nMost Expensive Operations:")
for trace in expensive_traces:
    print(f"  {trace.agent_name}: ${trace.total_cost:.6f} ({trace.total_tokens} tokens)")

Set Cost Alerts

from peargent.observability import get_tracer

tracer = get_tracer()
MAX_COST_PER_REQUEST = 0.01  # $0.01 limit

for update in agent.stream_observe(user_input):
    if update.is_agent_end:
        if update.cost > MAX_COST_PER_REQUEST:  
            print(f"⚠️ WARNING: Cost ${update.cost:.6f} exceeds limit!")
            # Log alert, notify admins, etc.

Track Costs by User

from peargent.observability import enable_tracing, set_user_id, get_tracer
from peargent.storage import Postgresql

tracer = enable_tracing(
    store_type=Postgresql(connection_string="postgresql://user:pass@localhost/db")
)

# Set user ID before agent runs
set_user_id("user_123")  

agent.run("Hello")

# Get costs for specific user
user_stats = tracer.get_aggregate_stats(user_id="user_123")  
print(f"User 123 total cost: ${user_stats['total_cost']:.6f}")

6. Use Streaming to Show Progress

While streaming doesn't reduce costs, it improves perceived performance, making slower/cheaper models feel faster:

from peargent import create_agent
from peargent.models import groq

# Use cheaper model with streaming
agent = create_agent(
    name="StreamingAgent",
    description="Shows progress immediately",
    persona="You are helpful.",
    model=groq("llama-3.1-8b")  # Cheaper model
)

# Stream response - user sees first token in ~200ms
print("Agent: ", end="", flush=True)
for chunk in agent.stream("Explain AI"):  
    print(chunk, end="", flush=True)

Benefit: Cheaper models feel faster with streaming, reducing pressure to use expensive models.

7. Count Tokens Before Sending

Estimate costs before making expensive calls:

from peargent.observability import get_cost_tracker

tracker = get_cost_tracker()

# Count tokens in your prompt
prompt = "Explain quantum computing in detail..."
token_count = tracker.count_tokens(prompt, model="llama-3.3-70b-versatile")  

print(f"Prompt will use ~{token_count} tokens")

# Estimate cost
estimated_cost = tracker.calculate_cost(  
    prompt_tokens=token_count,
    completion_tokens=500,  # Estimate 500 token response
    model="llama-3.3-70b-versatile"
)

print(f"Estimated cost: ${estimated_cost:.6f}")

# Decide whether to proceed
if estimated_cost > 0.01:
    print("Too expensive! Shortening prompt...")
    # Truncate or summarize prompt

Cost Optimization Checklist

Use this checklist for production deployments:

Model Selection

Using smallest viable model for each agent type
Tested cost vs quality tradeoff for your use case
Custom pricing configured for local/fine-tuned models

Context Management

HistoryConfig configured with appropriate strategy
max_context_messages set to reasonable limit (10-20)
Using "trim_last" for cost-sensitive applications
Cheaper model used for summarization if using "summarize" strategy

Output Control

max_tokens set based on expected response length
Persona/system prompt optimized for brevity
Examples moved from persona to tool descriptions
Temperature set to 0.0 for deterministic tasks

Monitoring

Tracing enabled in production
Cost tracking configured with accurate pricing
Regular analysis of aggregate statistics
Alerts set for expensive operations
Per-user cost tracking implemented

Implementation

Token counting used for cost estimation
Streaming enabled for better UX with cheaper models
Cost limits enforced in application logic
Regular review of most expensive operations

Summary

Biggest Cost Savings:

History Management - Use trim_last with max_context_messages=10 (saves 50-80% on tokens)
Model Selection - Use smaller models for simple tasks (saves 50-90% on costs)
Persona Optimization - Short personas (saves 5-10% per request)
max_tokens - Limit output length (saves 20-40% on completion tokens)

Essential Monitoring:

Enable tracing in production
Track costs per user/session
Analyze aggregate statistics weekly
Set alerts for expensive operations

Start with history management and model selection for the biggest impact!

Optimizing for Cost

Strategies to control token usage and reduce API costs

On this page