SQL Queries Every LLM Engineer Should Know

Educational

Sheary Tan

SEP 3, 2025

The SQL Queries Every LLM Engineer Should Know (And Why They Matter More Than You Think)

Let’s be honest—when you first heard about building LLM projects, SQL probably wasn’t the first thing that came to mind. You were probably thinking about transformers, attention mechanisms, maybe even dreaming about the next GPT breakthrough. But here’s the thing: behind every successful LLM deployment, there’s a mountain of data that needs organizing, cleaning, and querying. And guess what handles most of that heavy lifting? Good old SQL.

After working on dozens of LLM projects (and making plenty of mistakes along the way), I’ve noticed certain SQL patterns keep showing up. These aren’t your typical “SELECT * FROM users” queries—these are the workhorses that actually make LLM projects tick.

Why SQL Still Rules the LLM World

You know what’s funny? We’re living in an age of cutting-edge AI, yet we’re still relying on a language from the 1970s to manage our data. But there’s a good reason for this: SQL just works. When you’re dealing with massive datasets for training, fine-tuning, or even just storing conversation logs, SQL databases offer the reliability and performance you need.

Plus, let’s face it—most organizations already have their data in SQL databases. You’re not going to convince your CTO to migrate everything to a vector database just because it sounds cooler.

The Heavy Hitters: Core Queries You’ll Use Daily

1. The Training Data Aggregator

WITH clean_data AS (
    SELECT
        id,
        prompt,
        response,
        quality_score,
        token_count,
        created_at,
        ROW_NUMBER() OVER (PARTITION BY prompt ORDER BY quality_score DESC) as rn
    FROM training_conversations
    WHERE
        LENGTH(prompt) > 10
        AND LENGTH(response) > 20
        AND quality_score >= 0.7
        AND token_count <= 4096
),
balanced_sample AS (
    SELECT *
    FROM clean_data
    WHERE rn = 1  -- Remove duplicate prompts, keeping highest quality
)
SELECT
    prompt,
    response,
    quality_score
FROM balanced_sample
ORDER BY RANDOM()
LIMIT 50000;

This query does something crucial—it cleans and balances your training data. The ROW_NUMBER() window function handles duplicates (because trust me, you’ll have them), while the filtering ensures you’re not training on garbage data. I learned this the hard way after spending three days training a model on data that included a bunch of “test123” prompts.

Speed up your Data Science Team 10x times. Get started with Livedocs and build your first live dashboard in minutes.

—

2. The Token Budget Calculator

SELECT
    conversation_id,
    SUM(token_count) as total_tokens,
    COUNT(*) as message_count,
    AVG(token_count) as avg_tokens_per_message,
    CASE
        WHEN SUM(token_count) > 16384 THEN 'truncate_needed'
        WHEN SUM(token_count) > 8192 THEN 'long_context'
        ELSE 'standard'
    END as context_category
FROM chat_messages
WHERE conversation_id IN (
    SELECT conversation_id
    FROM active_conversations
    WHERE last_updated > CURRENT_DATE - INTERVAL '7 days'
)
GROUP BY conversation_id
HAVING SUM(token_count) > 100
ORDER BY total_tokens DESC;

Context window management is probably one of the trickiest parts of working with LLMs. This query helps you understand your token distribution and identify conversations that might need chunking or truncation. The categorization makes it easy to handle different conversation lengths programmatically.

—

3. The Performance Monitor

WITH response_metrics AS (
    SELECT
        DATE(created_at) as query_date,
        model_version,
        AVG(response_time_ms) as avg_response_time,
        AVG(token_count) as avg_tokens,
        COUNT(*) as total_requests,
        SUM(CASE WHEN response_time_ms > 5000 THEN 1 ELSE 0 END) as slow_responses,
        AVG(user_rating) as avg_rating
    FROM llm_requests
    WHERE created_at >= CURRENT_DATE - INTERVAL '30 days'
    GROUP BY DATE(created_at), model_version
)
SELECT
    query_date,
    model_version,
    avg_response_time,
    avg_tokens,
    total_requests,
    ROUND(slow_responses * 100.0 / total_requests, 2) as slow_response_percentage,
    avg_rating,
    LAG(avg_response_time) OVER (PARTITION BY model_version ORDER BY query_date) as prev_response_time
FROM response_metrics
ORDER BY query_date DESC, model_version;

Monitoring isn’t glamorous, but it’s essential. This query gives you a daily breakdown of how your models are performing. The LAG() function is particularly handy—it lets you spot performance trends before they become problems.

—

The Data Wranglers: Preprocessing Queries

4. The Conversation Threader

WITH RECURSIVE conversation_thread AS (
    -- Base case: find conversation starters
    SELECT
        id,
        parent_id,
        user_id,
        content,
        created_at,
        1 as depth,
        CAST(id as VARCHAR) as thread_path
    FROM messages
    WHERE parent_id IS NULL

    UNION ALL

    -- Recursive case: find replies
    SELECT
        m.id,
        m.parent_id,
        m.user_id,
        m.content,
        m.created_at,
        ct.depth + 1,
        ct.thread_path || '->' || CAST(m.id as VARCHAR)
    FROM messages m
    INNER JOIN conversation_thread ct ON m.parent_id = ct.id
    WHERE ct.depth < 20  -- Prevent infinite recursion
)
SELECT
    thread_path,
    STRING_AGG(content, '\n---\n' ORDER BY depth) as full_conversation
FROM conversation_thread
GROUP BY thread_path
HAVING COUNT(*) >= 3;  -- Only conversations with multiple exchanges

Recursive CTEs aren’t something you use every day, but when you need to reconstruct conversation threads from a flat message table, they’re perfect. This is especially useful when preparing data for fine-tuning conversational models.

—

5. The Semantic Similarity Finder

SELECT
    a.id as query_id,
    a.content as query_content,
    b.id as similar_id,
    b.content as similar_content,
    -- Using a simple Jaccard similarity approximation
    (
        CARDINALITY(
            ARRAY(
                SELECT UNNEST(STRING_TO_ARRAY(LOWER(a.content), ' '))
                INTERSECT
                SELECT UNNEST(STRING_TO_ARRAY(LOWER(b.content), ' '))
            )
        ) * 1.0 /
        CARDINALITY(
            ARRAY(
                SELECT UNNEST(STRING_TO_ARRAY(LOWER(a.content), ' '))
                UNION
                SELECT UNNEST(STRING_TO_ARRAY(LOWER(b.content), ' '))
            )
        )
    ) as similarity_score
FROM user_queries a
CROSS JOIN user_queries b
WHERE
    a.id < b.id  -- Avoid duplicate pairs
    AND LENGTH(a.content) > 50
    AND LENGTH(b.content) > 50
HAVING similarity_score > 0.3
ORDER BY similarity_score DESC
LIMIT 1000;

Before you invest in fancy vector databases, sometimes a simple text similarity check in SQL can help you identify duplicate or near-duplicate queries. It’s not as sophisticated as embeddings, but it’s fast and gets the job done for basic deduplication.

—

The Production Workhorses

6. The Smart Cache Query

WITH cache_candidates AS (
    SELECT
        prompt_hash,
        prompt,
        response,
        model_version,
        COUNT(*) as request_count,
        AVG(response_time_ms) as avg_response_time,
        MAX(created_at) as last_requested
    FROM llm_requests
    WHERE
        created_at >= CURRENT_DATE - INTERVAL '30 days'
        AND response_time_ms < 10000  -- Only cache fast responses
        AND user_rating >= 4.0  -- Only cache well-rated responses
    GROUP BY prompt_hash, prompt, response, model_version
    HAVING COUNT(*) >= 5  -- Must be requested multiple times
)
INSERT INTO response_cache (prompt_hash, prompt, response, model_version, cache_score)
SELECT
    prompt_hash,
    prompt,
    response,
    model_version,
    request_count * (6.0 - avg_response_time/1000.0) as cache_score
FROM cache_candidates
ON CONFLICT (prompt_hash, model_version)
DO UPDATE SET
    cache_score = EXCLUDED.cache_score,
    updated_at = CURRENT_TIMESTAMP
WHERE response_cache.cache_score < EXCLUDED.cache_score;

Caching responses can dramatically improve your LLM application’s performance. This query identifies the best candidates for caching based on frequency, quality, and performance metrics. The cache_score calculation prioritizes frequently requested, fast-responding queries.

—

7. The Usage Analytics Powerhouse

WITH daily_usage AS (
    SELECT
        DATE(created_at) as usage_date,
        user_id,
        COUNT(*) as daily_requests,
        SUM(token_count) as daily_tokens,
        AVG(user_rating) as daily_avg_rating
    FROM llm_requests
    WHERE created_at >= CURRENT_DATE - INTERVAL '90 days'
    GROUP BY DATE(created_at), user_id
),
user_segments AS (
    SELECT
        user_id,
        AVG(daily_requests) as avg_daily_requests,
        AVG(daily_tokens) as avg_daily_tokens,
        AVG(daily_avg_rating) as overall_rating,
        COUNT(*) as active_days,
        CASE
            WHEN AVG(daily_requests) > 100 THEN 'power_user'
            WHEN AVG(daily_requests) > 20 THEN 'regular_user'
            WHEN AVG(daily_requests) > 5 THEN 'occasional_user'
            ELSE 'light_user'
        END as user_segment
    FROM daily_usage
    GROUP BY user_id
)
SELECT
    user_segment,
    COUNT(*) as user_count,
    AVG(avg_daily_requests) as segment_avg_requests,
    AVG(avg_daily_tokens) as segment_avg_tokens,
    AVG(overall_rating) as segment_avg_rating,
    AVG(active_days) as segment_avg_active_days
FROM user_segments
GROUP BY user_segment
ORDER BY segment_avg_requests DESC;

Understanding your users is crucial for optimizing LLM performance and costs. This query segments users based on their usage patterns, helping you identify who’s driving your compute costs and how satisfied different user groups are.

Speed up your Data Science Team 10x times. Get started with Livedocs and build your first live dashboard in minutes.

—

The Troubleshooters

8. The Error Pattern Detective

WITH error_analysis AS (
    SELECT
        error_type,
        error_message,
        model_version,
        DATE(created_at) as error_date,
        COUNT(*) as error_count,
        STRING_AGG(
            DISTINCT SUBSTRING(prompt, 1, 100),
            '; '
            ORDER BY SUBSTRING(prompt, 1, 100)
        ) as sample_prompts
    FROM llm_errors
    WHERE created_at >= CURRENT_DATE - INTERVAL '7 days'
    GROUP BY error_type, error_message, model_version, DATE(created_at)
),
error_trends AS (
    SELECT
        *,
        LAG(error_count) OVER (
            PARTITION BY error_type, model_version
            ORDER BY error_date
        ) as prev_day_count
    FROM error_analysis
)
SELECT
    error_type,
    error_message,
    model_version,
    error_date,
    error_count,
    COALESCE(error_count - prev_day_count, 0) as day_over_day_change,
    sample_prompts
FROM error_trends
WHERE error_count > 10
ORDER BY error_count DESC, error_date DESC;

When things go wrong (and they will), this query helps you quickly identify patterns in your errors. Are certain types of prompts causing issues? Is a specific model version problematic? The day-over-day comparison helps you spot emerging problems.

Making It All Work Together

Here’s something I wish someone had told me earlier: these queries aren’t meant to run in isolation. The real magic happens when you string them together into a data pipeline. You might use the training data aggregator to prep your dataset, then the performance monitor to track how well your fine-tuned model performs, and finally the cache query to speed up common requests.

The key is building these queries into your workflow gradually. Don’t try to implement everything at once—start with the basics and add complexity as your project grows.

A Quick Reality Check

Let me be straight with you: SQL isn’t going to solve all your LLM problems. You’ll still need proper MLOps tools, monitoring systems, and probably a vector database for semantic search. But SQL will handle the grunt work of data management, and it’ll do it reliably.

These queries have saved me countless hours of debugging and optimization. They’re not perfect—you’ll need to adapt them to your specific schema and requirements—but they’re a solid starting point.

The next time you’re knee-deep in LLM development and wondering why your training data looks weird or your responses are slow, remember these queries. They might just be the debugging tools you need.

Final Thoughts

As LLM projects evolve, so do the SQL patterns we use. I’m seeing more integration with time-series data for monitoring, better approaches to handling multi-modal data, and smarter caching strategies. The fundamentals remain the same though: clean data in, good models out.

Keep these queries handy, adapt them to your needs, and don’t forget—sometimes the old tools are the best tools for the job.

The fastest way to get there in 2026? Livedocs.

Instant data connections
Drag-and-drop editor
Real-time updates
Easy sharing

Get started with Livedocs and build your first live dashboard in minutes.

—

💬 If you have questions or feedback, please email directly at a[at]livedocs[dot]com
📣 Take Livedocs for a spin over at livedocs.com/start. Livedocs has a great free plan, with $10 per month of LLM usage on every plan
🤝 Say hello to the team on X and LinkedIn

Stay tuned for the next tutorial!