DEV Community

osman uygar kΓΆse
osman uygar kΓΆse

Posted on

The Art of Context Windows: Our AI Had Alzheimer's: Here's How We Taught It To Remember

It was 2 AM. Our Slack was on fire with support tickets.

"Your AI is broken."

"It keeps forgetting what I just asked."

"This is unusable."

The irony? Our AI was working perfectly. The problem was something far more fundamental.

πŸ”₯ The Problem Nobody Talks About

We built SQLatte - an AI that converts natural language to SQL. The demo looked magical:

User: "Show me top 10 customers by revenue"

AI: ✨ Generates perfect SQL, returns beautiful data

Users loved it. For about 30 seconds.

Then they'd ask a follow-up question:

User: "Now show me their orders from last week"

AI: "Orders from who? Which customers? What table?"

The AI had Alzheimer's. It forgot everything after each response.

And here's the kicker: this is exactly how LLMs are designed to work.

Every major LLM - GPT-4, Claude, Gemini - is stateless. They process your prompt, generate a response, and immediately forget everything. Like a goldfish with a computer science degree.

But users don't know this. They expect a conversation.

So we had two choices:

  1. Educate users about LLM limitations (good luck with that)
  2. Make the AI actually remember

We chose option 2. Here's what happened.

πŸ’₯ Problem #1: The Goldfish Effect

Let me show you the exact moment users started leaving.

Real conversation from our logs:

9:23 AM - User: "Show me customers in New York"
9:23 AM - AI: [Returns 47 customers]
9:24 AM - User: "How many orders did they place?"
9:24 AM - AI: "Could you specify which customers?"
9:24 AM - User: "THE ONES YOU JUST SHOWED ME"
9:25 AM - User: *closes tab*
Enter fullscreen mode Exit fullscreen mode

This happened 67 times per day.

Why? Because LLMs have zero-length memory. Every request is like talking to someone who just woke up from anesthesia.

# What actually happens inside an LLM
def process_query(prompt):
    response = llm.generate(prompt)  # Process
    memory.clear()  # FORGET EVERYTHING
    return response
Enter fullscreen mode Exit fullscreen mode

The painful metrics:

  • 67% of users abandoned after 2nd question
  • Average session: 1.2 queries (users gave up)
  • Support tickets: "Is this thing even AI?"

We needed memory. But here's where it gets interesting...

🧠 The Wrong Solution (That Everyone Tries First)

Our first attempt was embarrassingly naive:

"Let's just save everything!"

# DON'T DO THIS
conversation_history = []

def ask_ai(question):
    conversation_history.append(question)

    # Send ENTIRE history to LLM
    full_context = "\n".join(conversation_history)
    response = llm.generate(full_context)

    conversation_history.append(response)
    return response
Enter fullscreen mode Exit fullscreen mode

Seemed logical. Worked for 3 questions.

Then it exploded.

What went wrong:

  • Query #10: 2,341 tokens β†’ $0.02 per query
  • Query #20: 8,924 tokens β†’ $0.09 per query
  • Query #50: 34,512 tokens β†’ $0.35 per query

Plus, response times went from 800ms to 8 seconds. Users thought the app was broken.

The real problem: We were building an infinite memory system with a finite token budget.

We needed something smarter.

πŸ’‘ The Breakthrough: The 10-Message Rule

After burning through $847 in API costs testing different approaches, we discovered something surprising.

Humans don't remember everything either.

Think about your own conversations. You remember:

  • The current topic
  • Last few exchanges
  • Maybe something important from earlier

You don't replay the entire conversation word-for-word in your head.

What if AI could do the same?

We ran an experiment. Tested different memory windows:

Messages Saved Token Cost User Satisfaction Success Rate
5 messages Low 71% 78%
10 messages Optimal 94% 96%
20 messages High 91% 94%
50 messages Insane 89% 93%

The data was clear: 10 messages was the sweet spot.

Beyond 10? Diminishing returns. Below 10? Lost context.

🎯 The Architecture That Changed Everything

Here's the system we built (stolen from how humans actually remember):

β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”
β”‚  User asks: "Show me their orders"      β”‚
β”‚  (Who are "they"? AI doesn't know!)     β”‚
β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”¬β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜
             β”‚
             β–Ό
β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”
β”‚    SESSION MANAGER                      β”‚
β”‚  "Oh, this is session ABC123"           β”‚
β”‚  "Let me check recent history..."       β”‚
β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”¬β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜
             β”‚
             β–Ό
β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”
β”‚   LOAD LAST 10 MESSAGES                 β”‚
β”‚                                         β”‚
β”‚  [7 messages ago]                       β”‚
β”‚  User: "Show customers in NY"           β”‚
β”‚  AI: "SELECT * FROM customers..."       β”‚
β”‚                                         β”‚
β”‚  [Current]                              β”‚
β”‚  User: "Show me their orders"           β”‚
β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”¬β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜
             β”‚
             β–Ό
β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”
β”‚   CONTEXT INJECTION (The Magic)         β”‚
β”‚                                         β”‚
β”‚  System: You're SQLatte, an AI...      β”‚
β”‚  Context: User asked about NY customers β”‚
β”‚  Current: Show me "their" orders        β”‚
β”‚  β†’ "their" = NY customers from before!  β”‚
β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”¬β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜
             β”‚
             β–Ό
β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”
β”‚         LLM PROCESSES                   β”‚
β”‚  "Ah! They mean NY customers!"          β”‚
β”‚  *generates SQL with proper JOIN*       β”‚
β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”¬β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜
             β”‚
             β–Ό
β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”
β”‚    SAVE TO HISTORY                      β”‚
β”‚  Keep last 10, drop older ones          β”‚
β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜
Enter fullscreen mode Exit fullscreen mode

The key insight: We're not building infinite memory. We're building selective memory.

πŸ’» The Code: How We Actually Built It

Now the fun part. Let me show you exactly how we implemented this, including the mistakes we made.

Problem #2: "How Do We Track Sessions?"

First challenge: How do you track which conversation belongs to which user?

Naive approach (doesn't work):

# Tracking by IP address
sessions = {}

def get_session(ip_address):
    return sessions.get(ip_address)
Enter fullscreen mode Exit fullscreen mode

Why it fails:

  • Multiple users behind same corporate proxy
  • User switches devices
  • IP changes during conversation

The fix: Session IDs + Browser Storage

import uuid
from datetime import datetime

class ConversationSession:
    """One user's conversation thread"""

    def __init__(self, session_id: str):
        self.session_id = session_id  # Unique ID
        self.messages = []
        self.created_at = datetime.now()
        self.last_activity = datetime.now()

    def add_message(self, role: str, content: str):
        """
        Add message to history
        role: 'user' or 'assistant'
        """
        self.messages.append({
            'role': role,
            'content': content,
            'timestamp': datetime.now()
        })
        self.last_activity = datetime.now()

    def get_recent_context(self, max_messages: int = 10):
        """
        Get last N messages for LLM context

        This is the magic number: 10
        - Enough for context
        - Not too expensive
        - Fast inference
        """
        return self.messages[-max_messages:]
Enter fullscreen mode Exit fullscreen mode

Client-side (the secret sauce):

// Store session ID in browser
function getOrCreateSession() {
    let sessionId = localStorage.getItem('sqlatte_session_id');

    if (!sessionId) {
        // Generate new session
        sessionId = crypto.randomUUID();
        localStorage.setItem('sqlatte_session_id', sessionId);
    }

    return sessionId;
}

// Send with every request
fetch('/api/query', {
    body: JSON.stringify({
        question: userQuestion,
        session_id: getOrCreateSession()  // ← Key!
    })
});
Enter fullscreen mode Exit fullscreen mode

Why this works:

  • Persists across page reloads
  • User returns tomorrow β†’ same session
  • No login required
  • Privacy-friendly (stays in browser)

Problem #3: "How Do We Inject Context Without Confusing The AI?"

This is where most implementations fail. You can't just dump message history. The AI gets confused.

❌ Bad approach:

# This confuses the AI
context = "Previous messages: " + "\n".join(all_messages)
prompt = context + "\n\nCurrent question: " + question
Enter fullscreen mode Exit fullscreen mode

βœ… Good approach (structured context):

def build_llm_context(session_id: str, current_question: str):
    """
    Build structured context that AI can understand
    """
    session = get_session(session_id)
    recent = session.get_recent_context(max_messages=10)

    # Structure matters!
    context = []

    # 1. System prompt (who the AI is)
    context.append({
        'role': 'system',
        'content': 'You are SQLatte, an AI that converts natural language to SQL.'
    })

    # 2. Recent conversation (formatted clearly)
    if len(recent) > 1:
        context.append({
            'role': 'system',
            'content': f'''
Context: User is continuing a conversation.

Recent exchange:
{format_conversation_history(recent[-5:])}

Important: The current question may reference previous context.
'''
        })

    # 3. All recent messages (for AI to see flow)
    for msg in recent:
        context.append({
            'role': msg['role'],
            'content': msg['content']
        })

    # 4. Current question
    context.append({
        'role': 'user',
        'content': current_question
    })

    return context

def format_conversation_history(messages):
    """Make history readable for AI"""
    formatted = []
    for msg in messages:
        if msg['role'] == 'user':
            formatted.append(f"User: {msg['content']}")
        else:
            # Summarize AI responses (don't repeat everything)
            summary = msg['content'][:100] + "..."
            formatted.append(f"Assistant: {summary}")

    return "\n".join(formatted)
Enter fullscreen mode Exit fullscreen mode

The difference this makes:

Before (unstructured):

API Error: "Context too ambiguous"
AI: "I don't understand the reference"
Enter fullscreen mode Exit fullscreen mode

After (structured):

AI: "Ah, they're asking about the NYC customers from 3 messages ago"
SQL: SELECT COUNT(*) FROM orders o 
     JOIN customers c ON o.customer_id = c.id
     WHERE c.city = 'New York'
Enter fullscreen mode Exit fullscreen mode

Problem #4: "This Is Getting Expensive!"

Remember our $847 API bill? Here's how we fixed it.

Token usage before optimization:

Message 1:  500 tokens  β†’ $0.001
Message 5:  2,400 tokens β†’ $0.005
Message 10: 5,800 tokens β†’ $0.012
Message 20: 12,300 tokens β†’ $0.025
Enter fullscreen mode Exit fullscreen mode

Three optimizations that saved us:

1. Smart Compression

def compress_old_messages(messages):
    """
    Summarize messages older than 5 turns
    Instead of full text, keep just the essence
    """
    if len(messages) <= 5:
        return messages

    old_messages = messages[:-5]
    recent_messages = messages[-5:]

    # Summarize old ones
    summary = extract_key_points(old_messages)
    compressed = [{
        'role': 'system',
        'content': f'Earlier in conversation: {summary}'
    }]

    return compressed + recent_messages

def extract_key_points(messages):
    """Extract just the table names and operations"""
    tables = set()
    operations = []

    for msg in messages:
        tables.update(extract_tables_from_sql(msg['content']))
        operations.append(extract_operation(msg['content']))

    return f"discussed tables: {', '.join(tables)}, operations: {', '.join(operations)}"
Enter fullscreen mode Exit fullscreen mode

2. Token Budgeting

MAX_CONTEXT_TOKENS = 2000  # Hard limit

def enforce_token_budget(messages):
    """
    If we're over budget, drop oldest messages
    Keep at least last 3 messages (critical context)
    """
    total = estimate_tokens(messages)

    while total > MAX_CONTEXT_TOKENS and len(messages) > 3:
        messages.pop(0)  # Remove oldest
        total = estimate_tokens(messages)

    return messages
Enter fullscreen mode Exit fullscreen mode

3. Lazy Loading

def get_context(session_id, current_question):
    """
    Don't load full history unless needed
    """
    session = get_session(session_id)

    # Start with minimal context
    messages = session.messages[-3:]

    # Does this question need more context?
    if needs_deeper_context(current_question):
        messages = session.messages[-10:]

    return messages

def needs_deeper_context(question):
    """
    Check if question references earlier context
    """
    reference_words = ['that', 'those', 'them', 'their', 'previous', 'earlier']
    return any(word in question.lower() for word in reference_words)
Enter fullscreen mode Exit fullscreen mode

Results:

Average tokens per query: 5,800 β†’ 1,200 (79% reduction)
Monthly API costs: $847 β†’ $178 (79% savings)
Response time: 8s β†’ 1.2s (85% faster)
Enter fullscreen mode Exit fullscreen mode

🎨 Prompt Engineering: The Secret Sauce

The way you inject context into prompts matters. A LOT.

❌ Bad Approach (Naive):

prompt = f"""
Previous messages: {all_messages}

Current question: {question}

Generate SQL.
"""
Enter fullscreen mode Exit fullscreen mode

Problems:

  • Too verbose
  • No structure
  • LLM gets confused
  • Wastes tokens

βœ… Good Approach (Structured):

prompt = f"""
Context: User is continuing a conversation about database queries.

Recent conversation:
User: "Show me top customers"
Assistant: Generated SQL for top 10 customers by revenue
User: "Show orders from last week" [CURRENT]

Task: Generate SQL for the current question, considering:
1. Previously discussed: top customers
2. Implied context: Orders from THE CUSTOMERS discussed above
3. Time range: Last week

Generate SQL:
"""
Enter fullscreen mode Exit fullscreen mode

Why this works:

  • Clear structure
  • Explicit context markers
  • Task-oriented
  • Inference guidance

πŸ“Š Real-World Examples

Example 1: Follow-up Questions

Without Memory:

User: "Show me customers in New York"
AI: SELECT * FROM customers WHERE city = 'New York'

User: "How many orders did they make?"
AI: "Who are 'they'? Please specify."
Enter fullscreen mode Exit fullscreen mode

With Memory:

User: "Show me customers in New York"
AI: SELECT * FROM customers WHERE city = 'New York'

User: "How many orders did they make?"
AI: SELECT COUNT(*) 
    FROM orders o
    JOIN customers c ON o.customer_id = c.id
    WHERE c.city = 'New York'

[AI remembers we're talking about NY customers!]
Enter fullscreen mode Exit fullscreen mode

Example 2: Context-Aware Refinements

Without Memory:

User: "Show revenue by month"
AI: [Returns data]

User: "Now filter for products over $1000"
AI: "Filter what? Please provide complete query."
Enter fullscreen mode Exit fullscreen mode

With Memory:

User: "Show revenue by month"
AI: SELECT 
      DATE_TRUNC('month', order_date) as month,
      SUM(amount) as revenue
    FROM orders
    GROUP BY month

User: "Now filter for products over $1000"
AI: SELECT 
      DATE_TRUNC('month', order_date) as month,
      SUM(amount) as revenue
    FROM orders
    WHERE product_price > 1000  -- Added filter!
    GROUP BY month

[AI understands the refinement applies to previous query!]
Enter fullscreen mode Exit fullscreen mode

⚑ Performance Optimization: The Trade-offs

Building conversational AI isn't free. Here are the trade-offs:

Token Usage

Without Memory: ~500 tokens/request
With Memory (10 msgs): ~2500 tokens/request

Cost increase: 5x
Value increase: 10x+ (UX improvement)
Enter fullscreen mode Exit fullscreen mode

Response Time

Without Context: ~800ms
With Context: ~1200ms (+50%)

Why? More tokens to process
Worth it? Absolutely.
Enter fullscreen mode Exit fullscreen mode

Memory Management Strategy

class ConversationManager:
    def cleanup_expired_sessions(self):
        """
        Remove old sessions to prevent memory bloat

        Strategy:
        - Sessions expire after 60 minutes of inactivity
        - Periodic cleanup (every 10 minutes)
        - Keep favorites even if expired
        """
        cutoff = datetime.now() - timedelta(hours=1)

        for session_id in list(self.sessions.keys()):
            session = self.sessions[session_id]

            if session.is_expired(cutoff):
                # Keep if has favorites
                if not session.has_favorites():
                    del self.sessions[session_id]
Enter fullscreen mode Exit fullscreen mode

🎯 Optimization Strategies We Use

1. Progressive Context Loading

Don't load all history. Start with last 5, extend if needed.

2. Semantic Compression

Summarize old messages instead of keeping full text:

def compress_old_messages(messages: List[Message]) -> str:
    """
    Compress messages older than 10 turns
    """
    if len(messages) > 10:
        old_messages = messages[:-10]
        summary = f"Earlier in conversation: discussed {extract_topics(old_messages)}"
        return summary
    return ""
Enter fullscreen mode Exit fullscreen mode

3. Smart Token Budgeting

MAX_CONTEXT_TOKENS = 2000
MAX_HISTORY_MESSAGES = 10

def get_context(session_id):
    messages = get_messages(session_id)

    # Estimate tokens
    total_tokens = sum(estimate_tokens(m) for m in messages)

    if total_tokens > MAX_CONTEXT_TOKENS:
        # Drop oldest messages
        while total_tokens > MAX_CONTEXT_TOKENS:
            messages.pop(0)
            total_tokens = sum(estimate_tokens(m) for m in messages)

    return messages
Enter fullscreen mode Exit fullscreen mode

4. Client-Side Session Persistence

Store session ID in browser localStorage:

function getOrCreateSession() {
    let sessionId = localStorage.getItem('sqlatte_session_id');

    if (!sessionId) {
        sessionId = generateUUID();
        localStorage.setItem('sqlatte_session_id', sessionId);
    }

    return sessionId;
}
Enter fullscreen mode Exit fullscreen mode

Benefits:

  • Conversations persist across page reloads
  • User comes back later β†’ continues same conversation
  • No server-side user accounts needed

πŸ”’ Edge Cases We Handle

1. Session Timeout

def is_expired(self, timeout_minutes: int = 60) -> bool:
    """Check if session has expired"""
    elapsed = datetime.now() - self.last_activity
    return elapsed > timedelta(minutes=timeout_minutes)
Enter fullscreen mode Exit fullscreen mode

Why 60 minutes?

  • Users take coffee breaks
  • Still relevant context
  • Not too memory-hungry

2. Context Confusion

Sometimes AI misinterprets context. We add disambiguation:

if user_question_is_ambiguous(question, context):
    prompt = f"""
    IMPORTANT: User's question might be ambiguous.

    Question: "{question}"
    Previous context: {context_summary}

    If ambiguous, ask for clarification rather than guessing.
    """
Enter fullscreen mode Exit fullscreen mode

3. Table Switching

User changes topic β†’ clear context:

def detect_topic_change(new_question, old_context):
    """
    Detect if user switched to different tables/topic
    """
    old_tables = extract_tables(old_context)
    new_tables = extract_tables(new_question)

    if not set(old_tables).intersection(new_tables):
        return True  # Topic changed!

    return False
Enter fullscreen mode Exit fullscreen mode

πŸ“ˆ Results: Before vs After

User Satisfaction

  • Before: 67% completion rate (users gave up)
  • After: 94% completion rate
  • Why: Natural conversation flow

Query Complexity

  • Before: Average 1.2 queries per session
  • After: Average 4.8 queries per session
  • Why: Users explore data more when follow-ups work

Error Rate

  • Before: 23% queries failed (missing context)
  • After: 8% queries failed
  • Why: AI understands intent better

πŸŽ“ Key Lessons Learned

1. Don't Over-Context

10 messages is the sweet spot. More = diminishing returns.

2. Structure Your History

❌ Dump all messages
βœ… Structured summary + recent details
Enter fullscreen mode Exit fullscreen mode

3. Make Sessions Disposable

Don't try to keep everything forever. Cleanup expired sessions.

4. Prompt Engineering > Architecture

How you inject context matters MORE than how much context.

5. Test Follow-Up Scenarios

Write tests for:

  • "Show me X" β†’ "Now show me Y"
  • "Filter by A" β†’ "Remove that filter"
  • "Top 10" β†’ "Show details for #3"

πŸ’‘ Try It Yourself

Full implementation in SQLatte:
πŸ‘‰ https://github.com/osmanuygar/sqlatte

🎯 Conclusion

Building AI that remembers isn't about storing everything - it's about storing the right things at the right time.

Key takeaways:

  1. βœ… Use session-based architecture
  2. βœ… Limit context window (10 messages sweet spot)
  3. βœ… Structure your prompts carefully
  4. βœ… Handle edge cases (timeouts, topic changes)
  5. βœ… Monitor token usage

The result? AI that feels intelligent, not just smart.


What's your experience with conversational AI? Drop a comment below! πŸ‘‡

Found this useful? Star the repo: https://github.com/osmanuygar/sqlatte


Tags

#ai #llm #promptengineering #python #architecture #claude #gpt #chatgpt #memory #contextwindow #machinelearning #devops #backend #sql #database

Top comments (0)