Context Windows: Why Size Matters (and How to Optimise)

Difficulty: Beginner Category: Concepts

Context Windows: Why Size Matters (and How to Optimise)

A developer recently fed an entire 50-page legal document into GPT-4, only to discover the model “forgot” the first 20 pages when answering questions about the conclusion. This isn’t a bug—it’s a fundamental constraint called the context window. As of March 2026, understanding context windows can mean the difference between a $0.50 API call and a $15 one, while dramatically improving response quality.

Prerequisites

Before diving in, you’ll need:

  • Access to at least one LLM API (OpenAI, Anthropic, or Google’s Gemini)
  • Basic Python knowledge (or ability to run API calls)
  • A text editor and terminal
  • Understanding that 1 token ≈ 0.75 English words

What Is a Context Window?

Think of a context window as your AI model’s “working memory.” Just like you can’t hold an entire encyclopedia in your head while having a conversation, language models have hard limits on how much text they can process in a single request.

Current context window sizes (March 2026):

  • GPT-4 Turbo: 128,000 tokens (~96,000 words)
  • Claude 3.5 Sonnet: 200,000 tokens (~150,000 words)
  • Gemini 1.5 Pro: 1,000,000 tokens (~750,000 words)
  • GPT-3.5 Turbo: 16,385 tokens (~12,000 words)

Each token counts toward your limit—including your system prompt, conversation history, uploaded documents, and the model’s response.

Step 1: Calculate Your Actual Token Usage

Let’s measure exactly what you’re sending to an API.

Install the token counter:

pip install tiktoken

Count tokens in your prompt:

import tiktoken

def count_tokens(text, model="gpt-4"):
    encoding = tiktoken.encoding_for_model(model)
    tokens = encoding.encode(text)
    return len(tokens)

# Real example: A support ticket
ticket = """
Customer: John Smith (ID: 847392)
Issue: Payment failed three times using card ending 4242
Previous interactions: 5 tickets, last resolved 2024-12-15
Account history: Premium member since 2023, $2,340 lifetime value
"""

print(f"Tokens used: {count_tokens(ticket)}")
# Output: Tokens used: 67

Gotcha: Many developers forget that system prompts count too. A typical system prompt uses 200-500 tokens, reducing your available context for actual data.

Step 2: Monitor Context Window Exhaustion

When you exceed the context window, models either truncate silently or throw errors. Here’s how to catch this:

from openai import OpenAI

client = OpenAI(api_key="your-key-here")

def safe_completion(messages, max_context=128000):
    total_tokens = sum(count_tokens(msg["content"]) for msg in messages)
    
    if total_tokens > max_context * 0.9:  # 90% threshold
        print(f"⚠️  Warning: Using {total_tokens} of {max_context} tokens")
        return None
    
    response = client.chat.completions.create(
        model="gpt-4-turbo-2024-04-09",
        messages=messages
    )
    
    return response.choices[0].message.content

# Example with actual conversation
messages = [
    {"role": "system", "content": "You are a helpful assistant."},
    {"role": "user", "content": "What's in this 50-page document?" + "x" * 100000}
]

result = safe_completion(messages)
# Output: ⚠️  Warning: Using 115847 of 128000 tokens

Pro Tip: OpenAI’s API response includes usage metadata. Always log this to track actual consumption:

print(f"Prompt tokens: {response.usage.prompt_tokens}")
print(f"Completion tokens: {response.usage.completion_tokens}")
print(f"Total cost: ${(response.usage.total_tokens / 1000) * 0.03:.4f}")

Step 3: Implement Smart Chunking for Large Documents

When your input exceeds the context window, chunk it intelligently:

def chunk_document(text, chunk_size=10000, overlap=200):
    """Split text into overlapping chunks to preserve context"""
    words = text.split()
    chunks = []
    
    for i in range(0, len(words), chunk_size - overlap):
        chunk = ' '.join(words[i:i + chunk_size])
        chunks.append(chunk)
    
    return chunks

# Real example: Processing a research paper
long_paper = open('research_paper.txt').read()  # 30,000 words
chunks = chunk_document(long_paper, chunk_size=8000, overlap=500)

summaries = []
for i, chunk in enumerate(chunks):
    response = client.chat.completions.create(
        model="gpt-4-turbo-2024-04-09",
        messages=[
            {"role": "system", "content": "Summarize this section in 100 words."},
            {"role": "user", "content": chunk}
        ]
    )
    summaries.append(response.choices[0].message.content)
    print(f"Processed chunk {i+1}/{len(chunks)}")

# Then combine summaries
final_summary = "\n\n".join(summaries)

Gotcha: Don’t chunk mid-sentence or mid-paragraph. Always use natural breakpoints like double newlines or section headers.

Step 4: Use Prompt Compression Techniques

Reduce token usage by 40-60% without losing information:

Before compression (427 tokens):

The customer, whose name is Jennifer Martinez and whose customer ID is 
CM-99234, contacted our support team on March 3rd, 2026 at approximately 
2:30 PM Eastern Standard Time regarding an issue she was experiencing...

After compression (178 tokens):

Customer: Jennifer Martinez (CM-99234)
Contact: 2026-03-03 14:30 EST
Issue: Payment processing failure
Details:
- Card ending 5678
- Error code: DECLINED_CVV
- Retry attempts: 3

Pro Tip: Use structured formats (JSON, YAML, or bullet points) instead of prose when feeding data to models. They’re 2-3x more token-efficient.

Step 5: Leverage Rolling Context Windows

For chat applications, implement a sliding window that keeps recent context:

class ContextManager:
    def __init__(self, max_tokens=16000):
        self.max_tokens = max_tokens
        self.messages = []
    
    def add_message(self, role, content):
        self.messages.append({"role": role, "content": content})
        self._trim_context()
    
    def _trim_context(self):
        """Keep only recent messages that fit in context"""
        total = 0
        keep_from = len(self.messages)
        
        # Count backwards, always keep system message
        for i in range(len(self.messages) - 1, 0, -1):
            msg_tokens = count_tokens(self.messages[i]["content"])
            if total + msg_tokens > self.max_tokens:
                keep_from = i + 1
                break
            total += msg_tokens
        
        # Keep system message + recent messages
        self.messages = [self.messages[0]] + self.messages[keep_from:]
    
    def get_messages(self):
        return self.messages

# Usage
chat = ContextManager(max_tokens=4000)
chat.add_message("system", "You are a helpful assistant.")
chat.add_message("user", "Tell me about Python.")
# ... conversation continues ...
# Old messages automatically dropped when limit reached

Step 6: Choose the Right Model for Your Context Needs

Decision matrix:

  • < 4K tokens: GPT-3.5 Turbo ($0.0015/1K) — cheapest option
  • 4K-16K tokens: GPT-4 Turbo ($0.03/1K) — balanced performance
  • 16K-128K tokens: Claude 3.5 Sonnet ($0.015/1K) — cost-effective for large context
  • 128K-1M tokens: Gemini 1.5 Pro ($0.007/1K) — massive documents

Real cost comparison for a 100K token input:

  • Gemini 1.5 Pro: $0.70
  • Claude 3.5 Sonnet: $1.50
  • GPT-4 Turbo: $3.00

Practical Example: Optimized Document Q&A System

Here’s a complete system that handles documents of any size efficiently:

import tiktoken
from openai import OpenAI

class SmartDocumentQA:
    def __init__(self, model="gpt-4-turbo-2024-04-09", max_context=100000):
        self.client = OpenAI()
        self.model = model
        self.max_context = max_context
        self.encoding = tiktoken.encoding_for_model(model)
    
    def ask(self, document, question):
        # Count tokens
        doc_tokens = len(self.encoding.encode(document))
        question_tokens = len(self.encoding.encode(question))
        system_tokens = 50  # estimated
        
        total = doc_tokens + question_tokens + system_tokens
        
        if total < self.max_context * 0.9:
            # Fits in context - send directly
            return self._direct_query(document, question)
        else:
            # Too large - use chunked approach
            return self._chunked_query(document, question)
    
    def _direct_query(self, document, question):
        response = self.client.chat.completions.create(
            model=self.model,
            messages=[
                {"role": "system", "content": "Answer questions about the provided document."},
                {"role": "user", "content": f"Document:\n{document}\n\nQuestion: {question}"}
            ]
        )
        return response.choices[0].message.content
    
    def _chunked_query(self, document, question):
        # First pass: Extract relevant chunks
        chunks = self._chunk_text(document, 8000)
        relevant_chunks = []
        
        for chunk in chunks[:10]:  # Limit to first 10 chunks for demo
            response = self.client.chat.completions.create(
                model="gpt-3.5-turbo",  # Use cheaper model for filtering
                messages=[
                    {"role": "user", "content": f"Is this text relevant to '{question}'? Answer YES or NO.\n\n{chunk}"}
                ]
            )
            if "YES" in response.choices[0].message.content.upper():
                relevant_chunks.append(chunk)
        
        # Second pass: Answer using only relevant chunks
        combined = "\n\n---\n\n".join(relevant_chunks)
        return self._direct_query(combined, question)
    
    def _chunk_text(self, text, size):
        words = text.split()
        return [' '.join(words[i:i+size]) for i in range(0, len(words), size)]

# Real usage
qa = SmartDocumentQA()
document = open('company_handbook.txt').read()  # 80,000 words
answer = qa.ask(document, "What is the remote work policy?")
print(answer)

Key Takeaways

  • Context windows are hard limits: GPT-4 Turbo’s 128K tokens ≈ 96,000 words, but your actual usable space is less after system prompts and conversation history
  • Token counting is essential: Use tiktoken to measure exact usage before sending requests—this prevents silent truncation and controls costs
  • Chunking + compression = 60-80% savings: Structured formats (JSON/YAML) and intelligent chunking reduce token usage dramatically compared to raw prose
  • Model selection matters: For 100K tokens, Gemini 1.5 Pro costs $0.70 vs GPT-4 Turbo’s $3.00—choose based on your context needs

What’s Next

Now that you understand context windows, learn about vector databases and embeddings to handle documents too large for any context window by retrieving only relevant sections.


Key Takeaway: Context windows determine how much information an AI model can process at once. Understanding window sizes (from GPT-4’s 128K to Claude 3.5’s 200K tokens) and optimization techniques like chunking and compression can reduce costs by 60-80% while improving response quality.


New AI tutorials published daily on AtlasSignal. Follow @AtlasSignalDesk for more.


📧 Get Daily AI & Macro Intelligence

Stay ahead of market-moving news, emerging tech, and global shifts.

Categories:

Updated: