Back to home

LLM Learnings

A collection of practical insights for working with large language models effectively.

Mechanistic Foundations

Autoregressive Self-Reinforcement

LLMs are autoregressive: each token predicted conditioned on all prior tokens, including its own output. More precisely, a stochastic process that predicts next tokens with emulation of human-like reasoning.

Emitted tokens become context. Bad starts compound; good starts compound. Early tokens carry disproportionate weight. Once the model emits a pattern (hedging, verbosity, a flawed reasoning step), those tokens become evidence that "this is the kind of response we're writing" and further output conditions accordingly.

RLHF Shapes Surface Behavior

Hedging, politeness, "epistemic humility": these exist because humans rated them positively during training. Outputs pattern-match to reinforced patterns without intent.

Attention and Positional Effects

Attention weight isn't uniform across context. It operates at three scales.

Token order (within a prompt): Same words in different order produce different attention weights. LLMs encode positional information despite parallel processing.

Lost in the middle (within context window): Attention concentrates at context start and end. Critical information placed mid-context gets deprioritized. Place high-signal content at boundaries.

Prompt hierarchy positioning (structural): Base prompts anchor from context start, user input anchors from the end, both high-attention positions. System/developer prompts sit in the middle where they're more easily overwhelmed. The hierarchy isn't just priority rules; it's architecturally positioned to exploit attention distribution.

Caveat: Vendors claim improvements mitigate known issues. Your 200k-token codebase may disagree. Official docs describe lab conditions. Real workloads break things the benchmarks didn't test. What's claimed is rarely what you get at scale. Adjust expectations accordingly.


Prompt Engineering

Prefer / Avoid / Unless

Soft constraints match probabilistic weighting better than hard rules. Explicit exception clauses reduce ambiguity. Structure parses cleanly as a decision tree. Mirrors high-quality instructional text in training data.

Avoid NEVER/ALWAYS/DONT. Hard constraints pigeonhole the model. When edge cases arise that can't reconcile with an absolute rule, the model either thrashes (oscillating, hedged output), overcorrects (avoids anything adjacent), or forces compliance in ways that violate intent. Soft constraints with escape hatches let the model navigate gracefully.

Counteract the Brevity Bias

RLHF optimizes for perceived helpfulness, which models often interpret as "be concise." The default pull is toward quick answers and premature convergence. Telling the model to be "comprehensive and thorough" grants permission to explore more paths before committing, allocate more tokens to the problem, and resist the urge to declare victory early.

This is the inverse of the premature root cause problem: the model wants to finish. You're telling it the task isn't finished until more ground is covered.

One Good Example Beats Ten

Show the model what "right" looks like. One clear example of desired output anchors the pattern better than paragraphs of description. If relevant, pair it with a "wrong" example to define the boundary.

Don't overload with examples. Attention diffuses across many; one or two sharp ones land harder. The goal is pattern establishment, not exhaustive coverage.

Focus Activates Relevant Patterns

Framing a task with a specific role or lens ("Act as a security auditor", "Do a QA review") focuses attention on relevant patterns. The model isn't pretending: you're selecting which subset of its training to weight.

For multi-faceted tasks like code review, run separate passes with distinct focuses: code quality, architecture, QA, security. Each pass lights up different "brain cells." A single prompt asking for everything dilutes focus; sequential targeted passes produce sharper results.

Strip the Filler

Conversational padding ("I was wondering if you could...", "It would be great if...", "Please and thank you") is noise. Every token competes for attention. Filler tokens dilute signal and reduce output quality.

Direct prompts produce better results than polite ones.

Broader principle: always consider your model's context length and training. Effective prompt optimization is part science, part art, but wasted tokens are just waste.

Self-Review via Turn Reset

After the model produces output, undo the turn and feed the response back as input for review. The model is often a better critic than generator: reviewing completed work activates different patterns than producing it. Nearly always catches issues missed on the first pass.

This exploits the asymmetry between generation mode (converge on an answer) and evaluation mode (find problems). Same model, different task framing, better results.

State Machine Logic in Agent Configs

When workflows are stateful, explicit conditional syntax works well:

IF code is changed
THEN run build
THEN run tests
END

LLMs pattern-match to conditional structures. State machine syntax makes transitions unambiguous.

Stateless declarations (ALWAYS x after y) work when the mapping is pure: same input, same output, no context dependency. The test: if the same trigger could legitimately need different responses depending on context not captured in the statement, it's not stateless and ALWAYS will fail.

Prompt Hierarchy

Base prompts from the provider (Anthropic, OpenAI) are immutable. You can influence behavior through system/developer prompts but not override base constraints.

The stack runs from lowest priority to highest attention weight.

  1. Base prompt: provider-level, can't change
  2. Global config: ~/.claude/CLAUDE.md or ~/.codex/AGENTS.md, universal rules across all projects
  3. Project config: repo root, project-specific rules
  4. Subfolder config: subdirectory, most specific, highest attention

Lower levels get more attention weight but cannot override higher levels. Global config is for rules that apply everywhere; project and subfolder configs specialize. Don't repeat yourself: if it applies universally, it goes in global.

This structure also exploits attention positioning. See Attention and Positional Effects.

Shared Config Across LLMs

Don't maintain duplicate agent configs per LLM. Create one canonical file (e.g., AGENTS.md) and import it:

# CLAUDE.md
@AGENTS.md

# Claude-specific overrides here if needed

The @ notation triggers auto-read at session start. Gemini, Claude, Codex all support this pattern. One source of truth, no config drift. LLM-specific files can add overrides after the import if needed.

Lint Your Prompts Through an LLM

Before deploying agent configs, run them through an LLM and ask it to identify:

Ambiguity is runtime cost: the model churns on interpretation instead of execution. Define a concrete design philosophy. "Best practices" means nothing. "Functions under 40 lines, single responsibility, explicit error handling" means something. Remove every undefined term.


Workflow Architecture

Multi-Model Workflows and the Human Director Role

RLHF optimizes for task completion, not task questioning. All models are trained toward "do the thing" rather than "should we do the thing." Two failure modes follow.

No model defaults to adversarial review. The human director role fills this gap: gather information first (architecture, tradeoffs, prior art), make foundational calls, then direct implementation. Vibe coding produces running artifacts; it doesn't produce sound architecture. The prerequisite work (asking foundational questions before implementation) remains a human responsibility.

Model allocation by trait:

Model Selection

Use the right model for the task. Some models are good at reasoning, some at speed, some at following instructions. Some aren't good at anything. Don't force a round peg into a square hole: if a model consistently fails at a task type, switch models instead of fighting it.

Never use Grok. Don't.

Context Growth and Compaction Decay

Long sessions with repeated compaction degrade output quality even when the model is strong. Compaction is theoretically lossless but practically lossy.

Rule of thumb: complete a task, start a fresh session. Don't run a single context for 8+ hours through repeated compaction.

Externalize Plans to Files

For non-trivial tasks, have the LLM write its plan to a markdown file. This externalizes working memory and survives compaction.

Keep files clean. LLMs mark items DONE but don't remove stale content. Accumulated completed items become noise. Separate your concerns into distinct files.

Use pointers, not duplication. LLMs read whole files: inlined content fills context whether needed or not. Pointers (See code/graphics/vulkan/README.md for Vulkan architecture) let the model fetch only what's relevant to the current task. Single source of truth: one location to maintain for both you and the LLM. LLMs will insert redundant information if allowed; pointers enforce reading the source rather than copying it.

Log Files: Reduce Noise with Prefixes

LLMs avoid large files (context cost) and attention diffuses across noise. Log files are useful for debugging but only if the signal is extractable.

Prefix relevant lines with a consistent tag:

[agent] VulkanDescriptorLayouts: allocated model pool

This gives the model a pattern to anchor on. "Look for [agent] lines" becomes a targeted search instead of full-file evaluation. Same principle as pointers: direct the model to what matters rather than forcing it to assess everything.


Failure Mode Awareness

Debugging: Force Epistemic Hygiene

LLMs declare root causes prematurely. RLHF rewards confident task completion, but "found the bug" in context becomes load-bearing: subsequent reasoning builds on the assumption even when wrong.

Ban declarative language:

Require hedged language:

Confirmation requires evidence:

Even confirmed issues may be symptoms of deeper bugs. Keep analysis provisional until the full chain is traced.

LLMs Aren't Sycophantic: They're RLHF-Shaped

Sycophancy implies intent. LLMs are probability distributions shaped by human ratings. The outputs lean toward task completion and agreeableness because that's what got rewarded.

Semantic vs syntactic confusion. "Don't use std::optional" may be parsed as "don't use it that way" rather than "don't use this type." The model changes syntax while preserving semantics.

Tautological unit tests. LLMs produce tests that technically pass but prove nothing:

The model is technically correct. The test passes. No production code was exercised.

Fix: Use code coverage tools. If coverage doesn't increase, the test is wrong regardless of pass/fail.

ยทยทยท

The goal is to work with the model's mechanics, not against them.