# The Null Hypothesis

> Front-end, LLM tooling, and what actually works.

Author: Kumak
Site: https://kumak.dev

---

## Adding llms.txt to Your Astro Blog

URL: https://kumak.dev/adding-llms-txt-to-astro/
Published: 2025-11-30
Category: tutorial

> How to implement the llms.txt standard in Astro without dependencies. Three endpoints, ~150 lines of TypeScript, and your content becomes agent-accessible.

## What is llms.txt?

The [llms.txt specification](https://llmstxt.org/) proposes a standard location for LLM-readable content. Think of it like `robots.txt` for crawlers or `sitemap.xml` for search engines, but designed for AI agents.

The problem it solves: when an AI agent visits your website, it has to parse HTML, navigate around headers, footers, and sidebars, and extract the actual content. This wastes tokens and often produces messy results.

The solution: provide a clean, structured text file at a known location. Agents fetch `/llms.txt`, get a table of contents, and can request individual pieces of content in plain markdown.

## The Architecture

We'll build three endpoints that work together:

```
/llms.txt           → Index: "Here's what I have"
/llms-full.txt      → Everything: "Here's all of it at once"
/llms/[slug].txt    → Individual: "Here's just this one post"
```

**Why three?** Different agents have different needs:

- A quick lookup might only need the index to find one relevant post
- A RAG system might want everything in one request
- A focused query might want just one article without the overhead of the full dump

## File Structure

Here's where everything lives in your Astro project:

```
src/
├── utils/
│   └── llms.ts            # All the generation logic
├── pages/
│   ├── llms.txt.ts        # Index endpoint
│   ├── llms-full.txt.ts   # Full content endpoint
│   └── llms/
│       └── [slug].txt.ts  # Per-post endpoints (dynamic route)
```

The `utils/llms.ts` file contains all the logic. The page files are thin wrappers that call into it. This separation keeps the endpoints clean and the logic testable.

## Prerequisites

Before we start, you'll need these project-specific pieces:

- **`siteConfig`** - An object with `name`, `description`, `url`, and `author` properties
- **`getAllPosts()`** - A function that returns your content collection posts
- **`BlogPost`** - The type from Astro's content collections with `slug`, `body`, and `data`

The [complete gist](https://gist.github.com/szymdzum/a6db6ff5feb0c566cbd852e10c0ab0af) shows the full implementation with all type definitions.

## Part 1: Type Definitions

Let's start by defining the shapes of our data. Good types make the rest of the code self-documenting.

```typescript
// src/utils/llms.ts

// Basic item for the index - just enough to create a link
interface LlmsItem {
  title: string;
  description: string;
  link: string;
}

// Extended item for full content - includes the actual post data
interface LlmsFullItem extends LlmsItem {
  pubDate: Date;
  category: string;
  body: string;
}
```

Why two types? The index only needs titles and links. The full content dump needs everything. By extending `LlmsItem`, we ensure consistency while allowing the richer type where needed.

Now the configuration types for each generator:

```typescript
// Config for the index endpoint
interface LlmsTxtConfig {
  name: string;
  description: string;
  site: string;
  items: LlmsItem[];
  optional?: LlmsItem[];  // Links that agents can skip if context is tight
}

// Config for the full content endpoint
interface LlmsFullTxtConfig {
  name: string;
  description: string;
  author: string;
  site: string;
  items: LlmsFullItem[];
}

// Config for individual post endpoints
interface LlmsPostConfig {
  post: BlogPost;
  site: string;
  link: string;
}
```

The `optional` field in `LlmsTxtConfig` is part of the spec. It signals to agents: "these links are nice-to-have, skip them if you're running low on context window."

## Part 2: The Document Builder

Every endpoint needs to return a plain text `Response`. Instead of repeating this logic, we create one builder that handles it all:

```typescript
function doc(...sections: (string | string[])[]): Response {
  const content = sections
    .flat()                          // Flatten nested arrays
    .join("\n")                      // Join with newlines
    .replace(/\n{3,}/g, "\n\n")      // Normalize multiple blank lines to just one
    .trim();                         // Clean up edges

  return new Response(content + "\n", {
    headers: { "Content-Type": "text/plain; charset=utf-8" },
  });
}
```

**Why rest parameters with arrays?** This lets us compose documents flexibly:

```typescript
// These all work:
doc("# Title", "Some text");
doc(["# Title", "", "Some text"]);
doc(headerArray, bodyArray, footerArray);
```

**Why normalize newlines?** When composing from multiple arrays, you might accidentally get three or four blank lines in a row. The regex `/\n{3,}/g` catches any run of 3+ newlines and replaces it with exactly 2 (one blank line). Clean output, no matter how messy the input.

## Part 3: Helper Functions

Small, focused functions that each do one thing:

### Formatting Dates

```typescript
function formatDate(date: Date): string {
  return date.toISOString().split("T")[0];
}
```

Takes a Date, returns `"2025-11-30"`. The `split("T")[0]` trick extracts just the date part from an ISO string like `"2025-11-30T00:00:00.000Z"`.

### Building Headers

```typescript
function header(name: string, description: string): string[] {
  return [`# ${name}`, "", `> ${description}`];
}
```

Returns an array of lines. The empty string creates a blank line between the title and the blockquote description. This matches the llms.txt spec format.

### Building Link Lists

```typescript
function linkList(title: string, items: LlmsItem[], site: string): string[] {
  return [
    "",
    `## ${title}`,
    ...items.map((item) => `- [${item.title}](${site}${item.link}): ${item.description}`),
  ];
}
```

Creates a section with an H2 heading and a markdown list of links. Each link includes a description after the colon. The leading empty string ensures a blank line before the section.

### Building Post Metadata

```typescript
function postMeta(site: string, link: string, pubDate: Date, category: string): string[] {
  return [`URL: ${site}${link}`, `Published: ${formatDate(pubDate)}`, `Category: ${category}`];
}
```

Three lines of metadata for each post. This keeps the format consistent across the full dump and individual post endpoints.

## Part 4: Stripping MDX Syntax

If you use MDX, your post bodies contain things agents don't need:

```mdx

The actual content starts here...
```

We need to strip the import and the JSX component, but keep the markdown content:

```typescript
const MDX_PATTERNS = [
  /^import\s+.+from\s+['"].+['"];?\s*$/gm,           // import statements
  /<[A-Z][a-zA-Z]*[^>]*>[\s\S]*?<\/[A-Z][a-zA-Z]*>/g, // JSX blocks like 
  /<[A-Z][a-zA-Z]*[^>]*\/>/g,                         // Self-closing JSX like 
] as const;

function stripMdx(content: string): string {
  return MDX_PATTERNS.reduce((text, pattern) => text.replace(pattern, ""), content).trim();
}
```

**How the patterns work:**

1. **Import pattern**: Matches lines starting with `import`, followed by anything, then `from` and a quoted path. The `m` flag makes `^` match line starts.

2. **JSX block pattern**: Matches `` for components without children.

**Why PascalCase?** JSX components use PascalCase by convention. HTML elements are lowercase. So `<TldrBox>` gets stripped, but `<div>` or `<a href="...">` passes through. This also means code examples in fenced blocks are safe, since they're not parsed as actual JSX.

## Part 5: The Generators

Now we combine everything into the three main functions:

### Index Generator

```typescript
export function llmsTxt(config: LlmsTxtConfig): Response {
  const sections = [
    header(config.name, config.description),
    linkList("Posts", config.items, config.site),
  ];

  if (config.optional?.length) {
    sections.push(linkList("Optional", config.optional, config.site));
  }

  return doc(...sections);
}
```

Builds an array of sections, conditionally adds the optional section, then passes everything to `doc()`. The spread operator `...sections` unpacks the array into separate arguments.

**Output looks like:**

```markdown
# Site Name

> Site description

## Posts
- [Post Title](https://site.com/llms/post-slug.txt): Post description

## Optional
- [About](https://site.com/about): About the author
```

### Full Content Generator

```typescript
export function llmsFullTxt(config: LlmsFullTxtConfig): Response {
  const head = [
    ...header(config.name, config.description),
    "",
    `Author: ${config.author}`,
    `Site: ${config.site}`,
    "",
    "---",
  ];

  const posts = config.items.flatMap((item) => [
    "",
    `## ${item.title}`,
    "",
    ...postMeta(config.site, item.link, item.pubDate, item.category),
    "",
    `> ${item.description}`,
    "",
    stripMdx(item.body),
    "",
    "---",
  ]);

  return doc(head, posts);
}
```

**Why `flatMap`?** Each item produces an array of lines. Using `map` would give us an array of arrays. `flatMap` maps and flattens in one step, giving us a single array of all lines.

The horizontal rules (`---`) separate posts visually and give agents clear boundaries between content pieces.

### Individual Post Generator

```typescript
export function llmsPost(config: LlmsPostConfig): Response {
  const { post, site, link } = config;
  const { title, description, pubDate, category } = post.data;

  return doc(
    `# ${title}`,
    "",
    `> ${description}`,
    "",
    ...postMeta(site, link, pubDate, category),
    "",
    stripMdx(post.body ?? ""),
  );
}
```

The simplest generator. Destructures the config and post data, then builds a single document. The `post.body ?? ""` handles the edge case of a post without body content.

## Part 6: Data Transformers

We need functions to convert Astro's content collection format into our types:

```typescript
export function postsToLlmsItems(
  posts: BlogPost[],
  formatUrl: (slug: string) => string,
): LlmsItem[] {
  return posts.map((post) => ({
    title: post.data.title,
    description: post.data.description,
    link: formatUrl(post.slug),
  }));
}

export function postsToLlmsFullItems(
  posts: BlogPost[],
  formatUrl: (slug: string) => string,
): LlmsFullItem[] {
  return posts.map((post) => ({
    ...postsToLlmsItems([post], formatUrl)[0],
    pubDate: post.data.pubDate,
    category: post.data.category,
    body: post.body ?? "",
  }));
}
```

**Why the callback for URLs?** Different endpoints need different URL formats:

- Index links to `/llms/post-slug.txt` (the plain text version)
- Full content links to `/post-slug` (the HTML version)

By passing the formatter as a callback, the same transformer works for both cases.

**Why does `postsToLlmsFullItems` call `postsToLlmsItems`?** DRY principle. The full item includes everything from the basic item, plus extra fields. Instead of duplicating the mapping logic, we reuse it and spread the result.

## Part 7: The Endpoints

Now we wire everything up in Astro page files. These are intentionally thin.

### Index Endpoint

```typescript
// src/pages/llms.txt.ts

export const GET: APIRoute = async () => {
  const posts = await getAllPosts();

  return llmsTxt({
    name: siteConfig.name,
    description: siteConfig.description,
    site: siteConfig.url,
    items: postsToLlmsItems(posts, (slug) => `/llms/${slug}.txt`),
    optional: [
      { title: "About", link: "/about", description: "About the author" },
      { title: "Full Content", link: "/llms-full.txt", description: "All posts in one file" },
    ],
  });
};
```

The `APIRoute` type tells Astro this is an API endpoint, not an HTML page. The `.txt.ts` filename means it generates `/llms.txt`.

### Full Content Endpoint

```typescript
// src/pages/llms-full.txt.ts

export const GET: APIRoute = async () => {
  const posts = await getAllPosts();

  return llmsFullTxt({
    name: siteConfig.name,
    description: siteConfig.description,
    author: siteConfig.author,
    site: siteConfig.url,
    items: postsToLlmsFullItems(posts, (slug) => `/${slug}`),
  });
};
```

Almost identical structure. The URL formatter now points to HTML pages since agents reading the full dump might want to reference the original.

### Dynamic Per-Post Endpoints

```typescript
// src/pages/llms/[slug].txt.ts

export const getStaticPaths: GetStaticPaths = async () => {
  const posts = await getAllPosts();
  return posts.map((post) => ({
    params: { slug: post.slug },
    props: { post },
  }));
};

export const GET = ({ props }: { props: { post: BlogPost } }) => {
  return llmsPost({
    post: props.post,
    site: siteConfig.url,
    link: `/${props.post.slug}`,
  });
};
```

**What's `getStaticPaths`?** Astro needs to know at build time which pages to generate. This function returns an array of all valid slugs. Each entry includes `params` (the URL parameters) and `props` (data passed to the page).

**Why `[slug]` in the filename?** Square brackets denote a dynamic route in Astro. The file `[slug].txt.ts` generates `/llms/post-one.txt`, `/llms/post-two.txt`, etc.

## Part 8: Discovery

Agents need to find your llms.txt. The spec says to put it at the root (`/llms.txt`), similar to `robots.txt`. But you can also advertise it in HTML:

```html
<link rel="alternate" type="text/plain" href="/llms.txt" title="LLMs.txt" />
```

Add this to your base layout or head component, wherever you define other `<link>` tags like RSS or favicon.

This isn't part of the official spec, but follows web conventions. You can also register your site on directories like [llmstxt.site](https://llmstxt.site).

## Limitations

This implementation works for **content collections with markdown or MDX bodies**. It reads `post.body` directly, which is raw text.

For component-based pages (React, Vue, Svelte, or plain `.astro` files), there's no markdown body to extract. You'd need a different strategy:

- Render to HTML and strip tags (lossy, messy)
- Maintain separate content files (duplicate effort)
- Use a headless CMS where content exists independently

For most blogs, content collections are the right choice anyway.

## Why Not Use a Library?

There are Astro integrations for llms.txt. They auto-generate from all pages at build time. Sounds convenient, but:

1. You get everything, including pages you might not want exposed
2. No per-post endpoints
3. No control over the output format
4. Another dependency to maintain

This implementation is ~150 lines of TypeScript. You control exactly what's included. You understand every line. For something this simple, the DIY approach wins.

## Bonus: An SVG Icon

The llms.txt logo is four rounded squares in a plus pattern. Here's a simple SVG version you can use in your navigation:

```html
<svg
  xmlns="http://www.w3.org/2000/svg"
  width="24"
  height="24"
  viewBox="-4 -4 32 32"
  fill="currentColor"
  aria-hidden="true"
>
  <rect x="8" y="1" width="8" height="8" rx="2" opacity="0.6" />
  <rect x="1" y="8" width="8" height="8" rx="2" opacity="0.8" />
  <rect x="15" y="8" width="8" height="8" rx="2" opacity="0.7" />
  <rect x="8" y="15" width="8" height="8" rx="2" />
</svg>
```

**Design notes:**

- **`viewBox="-4 -4 32 32"`** adds padding so the icon matches the visual weight of stroke-based icons like Lucide
- **`fill="currentColor"`** inherits from CSS, so it works with any color scheme
- **Varying opacity** (0.6, 0.7, 0.8, 1.0) gives depth without using multiple colors
- **`rx="2"`** rounds the corners to match the original logo style

For Astro, wrap it in a component so you can pass `size` as a prop and reuse it across your site.

## The Result

After deploying, you have:

- `/llms.txt` - Index listing all posts with descriptions
- `/llms-full.txt` - Complete content for RAG systems or full context
- `/llms/post-slug.txt` - Individual posts for focused queries

Agents fetch the index, pick what they need, and get clean markdown. No HTML parsing, no navigation noise, no wasted tokens. That's the point of the standard.

---

## Testing in the Age of AI Agents

URL: https://kumak.dev/testing-in-the-age-of-ai-agents/
Published: 2025-11-29
Category: philosophy

> When code changes at the speed of thought, tests become less about verification and more about defining what should remain stable.

AI agents don't just write code faster. They make rewriting trivial. Your codebase becomes fluid, reshaping itself as fast as you can describe what you want.

But when everything is in flux, how do you know the features still work?

Something needs to hold the shape while everything inside it moves. That something is your tests. Not tests that document how the code works today, but tests that define what it must always do.

## The Contract Principle

The obvious purpose of tests is "catching bugs." But that's incomplete.

Tests define what "correct" means. They're a contract: this is what the system must do. Everything else is negotiable.

Kent Beck captured this precisely: tests should be "sensitive to behaviour changes and insensitive to structure changes." A test that breaks when behaviour changes is valuable. A test that breaks when implementation changes, but behaviour stays the same, is actively punishing you for improving your code.

The difference is stark in practice:

```typescript
// Testing implementation - breaks when you refactor
test('calls navItemVariants with correct params', () => {
  const spy = vi.spyOn(styles, 'navItemVariants');
  render();
  expect(spy).toHaveBeenCalledWith({ active: false });
});

// Testing contract - survives any rewrite
test('renders as a link to the specified route', () => {
  render();
  const link = screen.getByRole('link', { name: /orders/i });
  expect(link).toHaveAttribute('href', '/orders');
});
```

The first test knows the component uses a function called `navItemVariants`. Tomorrow, you might rename that function or eliminate it entirely. The test breaks. The component still works.

The second test knows only what matters: there's a link, it goes to `/orders`, it says "Orders." Rewrite the entire component. Swap the styling system. As long as users can click a link to their orders, the test passes.

```
$ npm test

❌ FAIL src/components/NavItem.test.tsx
  ✕ calls navItemVariants with correct params
  ✕ passes active prop to styling function
  ✕ renders with expected className

3 tests failed. The component works perfectly. 🙃
```

The tests haven't caught a bug. The behaviour is identical. You're just paying a tax on change.

## The Black Box

Treat every module like a black box. You know what goes in. You know what should come out. What happens inside is none of your tests' business.

This clarifies what to mock. External systems (APIs, databases, third-party services) exist outside your black box. Mock those. Your own modules exist inside. Don't mock those; let them run.

```typescript
// Mock external systems - they're outside your control
vi.mock('~/api/client', () => ({
  fetchUser: vi.fn().mockResolvedValue({ name: 'Test User' }),
}));

// Don't mock your own code - let it run
// ❌ vi.mock('~/components/ui/NavItem');
// ❌ vi.spyOn(myModule, 'internalHelper');
```

When you mock your own code, you're encoding the current implementation into your tests. When the implementation changes, your mocks become lies. They describe a structure that no longer exists, and your tests pass while your code breaks.

A useful heuristic: before committing a test, imagine handing the module's specification to a developer who'd never seen your code. They implement it from scratch, differently. Would your tests pass? If yes, you've tested the contract. If no, go back and fix the test.

## The Circular Verification Problem

Here's where AI changes everything.

Tests exist to verify that code is correct. If AI writes both the code and the tests, what verifies what? The test was supposed to catch AI mistakes. But AI wrote the test. You've created a loop with no external reference point.

> AI writes code → AI writes tests → tests pass → "correct"?

Black box tests break this circularity because they're human-auditable. When a test says "there's a link that goes to `/orders`," you can read that assertion and verify it matches the requirement. You don't need to understand implementation details.

Implementation-coupled tests aren't auditable this way. To verify the test is correct, you'd need to understand the implementation it's coupled to. You're back to trusting AI about AI's work.

This suggests specific rules:

**Treat assertions as immutable.** AI can refactor how a test runs: the setup, the helpers, the structure. AI should not change what a test asserts without explicit human approval. The assertion is the contract.

```typescript
// AI can change this (setup)
const user = await setupTestUser({ role: 'admin' });

// AI should NOT change this (assertion) without approval
expect(user.canAccessDashboard()).toBe(true);
```

**Failing behaviour tests require human attention.** When a contract-level test fails, AI shouldn't auto-fix it. The failure is information. A human must decide: is this a real bug, or did requirements change?

**Separate creation from modification.** AI drafting new tests for new features is relatively safe. AI modifying existing tests is riskier. New tests add coverage. Modified tests might silently remove it.

## What Not to Test

Simple, obvious code doesn't need tests. A component that renders a string as a heading doesn't need a test proving it renders a heading. A utility that concatenates paths doesn't need a test for every combination.

Test complex logic. Test edge cases. Test error handling. Test anything where a bug would be non-obvious or expensive to find later.

```typescript
// Congratulations, you've tested JavaScript
test('banana equals banana', () => {
  expect('🍌').toBe('🍌'); // ✅ PASS
});
```

Don't test that React renders React components. Don't test that TypeScript types are correct. Your test suite isn't a proof of correctness; it's a net that catches bugs that matter.

This restraint has a benefit: a smaller, focused test suite is easier to audit. When every test has a clear purpose, you can review what AI wrote and verify it matches intent.

## The Coverage Trap

Coverage measures execution, not intent. A test that executes a line of code isn't necessarily testing that the line does what it should.

Worse, coverage as a target incentivises exactly the wrong kind of tests. Need to hit 80%? Write tests that spy on every function, assert on every intermediate value. You'll hit your number. You'll also create a test suite that breaks whenever anyone improves the code.

```typescript
// Written for coverage, not for value
test('increases coverage', () => {
  const result = processOrder(mockOrder);
  expect(processOrder).toHaveBeenCalled(); // So what?
  expect(result).toBeDefined(); // Still nothing
});

// Written for behaviour
test('completed orders update inventory', () => {
  const order = createOrder({ items: [{ sku: 'ABC', quantity: 2 }] });
  processOrder(order);
  expect(getInventory('ABC')).toBe(initialStock - 2);
});
```

The real question isn't "how much code did my tests execute?" It's "would my tests catch a bug that matters?"

## A Philosophy for Flux

Tests are how you know code is correct. When both code and tests are fluid, when AI can change either at will, you lose the ability to verify anything. The test that passed yesterday means nothing if it was rewritten to match today's code.

The philosophy is simple:

> Test what the code does, not how it does it.

Tests become specifications, not surveillance. They define what matters, not document what exists. And because they encode observable behaviour rather than internal structure, they remain human-auditable even when AI writes them.

When code is in constant flux, tests are your fixed point. They're stable not because change is expensive, but because they define what "correct" means. Without that fixed point, you have no way to know if your fluid code is flowing in the right direction.

---

## Self-Documenting CLI Design for LLMs

URL: https://kumak.dev/self-documenting-cli-design-for-llms/
Published: 2025-11-28
Category: philosophy

> Agents start fresh every session. Instead of dumping docs upfront, build tools they can query. One take on agent-friendly tooling.

I'm building a CLI tool for browser debugging. It lets AI agents control Chrome through the DevTools Protocol: capture screenshots, inspect network requests, execute JavaScript. The Chrome DevTools Protocol has 53 domains and over 600 methods. That's a lot of capability and a lot of documentation.

Here's the problem: how do I teach an agent what's possible without dumping thousands of tokens into context every session?

Documentation is a wall of text about things you don't need yet. Worse, it drifts. The tool ships a new version, someone forgets to update the docs, and now the agent is following instructions for a method that was renamed three months ago. The tool and its documentation are two artifacts pretending to be one.

When Claude gets stuck with CLI tools, it naturally reaches for `--help`. When that's not enough, it tries `command subcommand --help`. The pattern is consistent: ask the tool, learn from the response, try again.

If `--help` is the agent's natural discovery method, how far can you push it?

## Progressive Disclosure

Instead of documenting everything upfront, make every layer queryable. Watch the conversation unfold:

```shell
# Agent asks: "What can you do?"
bdg --help --json

# Agent asks: "What domains exist?"
bdg cdp --list

# Agent asks: "What can I do with Network?"
bdg cdp Network --list

# Agent asks: "How do I get cookies?"
bdg cdp Network.getCookies --describe

# Agent executes with confidence
bdg cdp Network.getCookies
```

Each answer reveals exactly what's needed for the next question. Five interactions, zero documentation. The tool taught itself.

When the agent doesn't know the exact method name, semantic search bridges the gap:

```shell
$ bdg cdp --search cookie
Found 14 methods matching "cookie":

  Network.getCookies
    # Returns all browser cookies for the current URL
  Network.setCookie
    # Sets a cookie with the given cookie data
  Network.deleteCookies
    # Deletes browser cookies with matching name
  ...
```

The agent thinks "I need something with cookies" and the tool finds everything relevant. No guessing required.

## Errors That Teach

Actionable error messages have been a UX best practice for decades. What's different for agents is the stakes: humans can work around bad UX by searching Stack Overflow. Agents can't. They're stuck with what you give them, racing against a context window that's always shrinking.

And agents make mistakes constantly. They'll type `Network.getCokies` instead of `Network.getCookies`. They'll invent plausible-sounding methods that don't exist.

A typical error:

```shell
$ bdg cdp Network.getCokies
Error: Method not found
```

Now what? The agent has to guess, search, retry. Burn tokens.

Teaching errors provide the path forward:

```shell
$ bdg cdp Network.getCokies
Error: Method 'Network.getCokies' not found

Did you mean:
  - Network.getCookies
  - Network.setCookies
  - Network.setCookie
```

The correction arrives in the same response as the error. No round trip. The agent adapts immediately.

The fuzzy matching goes beyond typos. Try `Networking.getCookies` with the wrong domain name, and it still suggests `Network.getCookies`. The tool understands what you meant, not just what you typed.

Even empty results guide forward:

```shell
$ bdg dom query "article h2"
No nodes found matching "article h2"

Suggestions:
  Verify: bdg dom eval "document.querySelector('article h2')"
  List:   bdg dom query "*"
```

And success states show next steps:

```shell
$ bdg dom query "h1, h2, h3"
Found 5 nodes:
  [0] <h2> Recent Posts
  [1] <h3> Testing in the Age of AI Agents
  ...

Next steps:
  Get HTML: bdg dom get 0
  Details:  bdg details dom 0
```

Every interaction answers "what now?" Errors suggest fixes. Empty results suggest alternatives. Success shows what to do with the data.

## Semantic Exit Codes

Most tools return 1 for any error. Not helpful. Semantic exit codes create ranges with meaning:

- **80-89**: User errors. Bad input, fix it before retrying.
- **100-109**: External errors. API timeout, retry with backoff.

The agent can branch its logic without parsing error messages. Message, suggestion, exit code: three layers of guidance stacked together.

## The Result

I tested this with an agent starting from zero knowledge. No prior context, no documentation provided. Just the tool.

Five commands later, it was executing CDP methods successfully. It discovered the tool's structure, explored the domains, found the method it needed, understood the parameters, and executed.

When I introduced typos deliberately, the suggestions caught them. When commands failed, the exit codes pointed toward solutions. The agent recovered without external help.

The context cost? Roughly 500 tokens for discovery, versus thousands for a documentation dump. And those 500 tokens bought understanding, not just information.

## Design for Dialogue

External documentation will always drift from reality. The tool itself never lies about its own capabilities.

Tools designed for agents aren't dumbed down. They're more explicit. They expose their structure. They teach through interaction rather than requiring upfront reading.

Design for dialogue, not documentation.

---

## CLI vs MCP on Chrome DevTools Protocol

URL: https://kumak.dev/cli-vs-mcp-benchmarking-browser-automation/
Published: 2025-11-23
Category: opinion

> I ran benchmarks comparing CLI tools against MCP servers for browser automation. 13x more token efficient with CLI. Here's what I found.

Anthropic published an observation about [MCP and code execution](https://www.anthropic.com/engineering/code-execution-with-mcp): executable code in the filesystem might be more efficient for AI agents than protocol servers. Isn't that what CLI tools already are?

I tested this by comparing two approaches to browser automation:

- **CLI**: [bdg](https://github.com/szymdzum/browser-debugger-cli), a browser debugger CLI I built
- **MCP**: [Chrome DevTools MCP](https://github.com/ChromeDevTools/chrome-devtools-mcp), the official Chrome DevTools protocol server

Both interact with the Chrome DevTools Protocol. Same underlying capabilities. The question: does the interface matter?

## Methodology

Fresh Claude instance (Sonnet 4.5) with zero prior knowledge of either tool. Identical tasks across three websites: Hacker News (navigation, extraction), CodePen (inspection, screenshots), Amazon (product data, anti-bot stress test). No human guidance.

**Full methodology**: [BENCHMARK_PROMPT.md](https://github.com/szymdzum/browser-debugger-cli/blob/main/docs/benchmarks/BENCHMARK_PROMPT.md)

## Results

### Token Efficiency: 13x Difference

| Tool           | Total Tokens           | Per Test Average |
| -------------- | ---------------------- | ---------------- |
| **bdg (CLI)**  | 6,500                  | ~2,200           |
| **Chrome MCP** | 85,500                 | ~28,500          |
| **Difference** | **13x more efficient** |                  |

The gap comes from how each tool returns information:

**MCP**: Full accessibility snapshots. Every page state returns a complete accessibility tree. Amazon product page alone: **52,000 tokens** in one snapshot.

**CLI**: Targeted queries. `bdg dom query ".athing"` returns 30 Hacker News stories in 1,200 tokens.

### Command Count

| Test        | bdg Commands | MCP Calls |
| ----------- | ------------ | --------- |
| Hacker News | 11           | 8         |
| CodePen     | 6            | 5         |
| Amazon      | 4            | 3         |

MCP requires slightly fewer calls. But when one approach uses 13x more tokens per call, command count becomes less relevant.

### Discovery: Zero-Knowledge Learning

The most interesting result was watching how the agent learned each tool.

**bdg discovery path** (5 commands to first successful execution):

```bash
bdg --help --json
# Agent learns: 10 commands, exit codes, task mappings

bdg cdp --list
# Agent learns: 53 domains available

bdg cdp Network --list
# Agent learns: 39 methods in Network domain

bdg cdp --search cookie
bdg cdp Network.getCookies --describe
# Agent discovers parameters, return types, examples

bdg cdp Network.getCookies
# Successful execution
```

Zero external documentation. The tool taught itself through introspection.

**MCP discovery**: Requires understanding MCP protocol, parsing 10k+ token accessibility trees to find element UIDs, using UID-based selection from snapshots.

### Element Selection

**bdg**: Standard CSS selectors

```bash
bdg dom query ".athing"              # Hacker News stories
bdg dom query "#productTitle"        # Amazon product
bdg dom click "button[type=submit]"
```

**MCP**: UID-based from accessibility tree

```javascript
take_snapshot({}); // Returns 10k+ tokens
click({ uid: '1_28' }); // Must find UID in snapshot first
```

CSS selectors are familiar; UID-based selection is more robust for dynamic content. Trade-offs exist.

## Why the Gap Exists

**Return payload size**: MCP's accessibility snapshots include everything. CLI queries return only what you ask for. This is the primary driver of the 13x difference.

**Error handling**: CLI failures return structured errors with exit codes and suggestions. The agent can self-correct:

```bash
$ bdg dom click ".missing-button"
Error: Element not found: .missing-button
Exit code: 81 (user error)

Suggestions:
  - Verify selector: bdg dom query ".missing-button"
  - List all buttons: bdg dom query "button"
```

**Composability**: CLI tools integrate with Unix:

```bash
bdg peek --network | jq '.[] | select(.status >= 400)'
bdg dom query "button" | jq '.[0].nodeId' | xargs bdg dom click
```

If the tool doesn't provide exactly what you need, compose it with `jq`, `grep`, or `awk`. MCP servers require protocol extensions.

**Protocol access**: bdg exposes all 644 CDP methods across 53 domains. MCP servers expose curated subsets. Need `Profiler.startPreciseCoverage`? It's there via `bdg cdp Profiler.startPreciseCoverage`.

## Limitations

- **Small sample**: 3 websites
- **Single model**: Claude Sonnet 4.5 only
- **Specific scenarios**: Information extraction workflows
- **Bot detection**: Both tools faced blocks on some sites

The token efficiency advantage should apply equally to debugging workflows (console streaming, network inspection, profiling), but I didn't benchmark those specifically.

## Takeaway

For browser automation with AI agents, CLI tools proved 13x more token-efficient than MCP servers accessing the same underlying protocol. The difference comes from targeted queries vs. full snapshots, plus the ability to compose with Unix tools and access the complete CDP surface.

This isn't definitive. It's one data point. Your needs may differ.

**Full results**: [BENCHMARK_RESULTS_2025-11-23.md](https://github.com/szymdzum/browser-debugger-cli/blob/main/docs/benchmarks/BENCHMARK_RESULTS_2025-11-23.md)

---

## How My Agent Learned GitLab

URL: https://kumak.dev/how-my-agent-learned-gitlab/
Published: 2025-11-17
Category: tutorial

> Teaching an agent to use CLI tools isn't about writing perfect documentation. It's about creating a feedback loop where the tool teaches, the agent learns, and reflection builds institutional knowledge.

I work with a monorepo that has over 80 CI/CD jobs across 12 stages. When pipelines fail, I need to trace through parent pipelines, child pipelines, failed jobs, and error logs. There's an MCP server for GitLab. I tried it once, then installed `glab` and wrote a basic [skill file](https://gist.github.com/szymdzum/304645336c57c53d59a6b7e4ba00a7a6) with command examples.

What's interesting isn't the skill itself. It's how it developed through three investigation sessions.

## Session One: Real-Time Self-Correction

"Investigate pipeline 2961721" was my first request. Claude ran a command. Got 20 jobs back. The pipeline had 80+.

I watched Claude notice the discrepancy, run `glab api --help`, spot the `--paginate` flag, and try again. This time: all the jobs.

Then it pulled logs with `glab ci trace <job-id>`. The logs looked clean. No errors visible. But the job had definitely failed.

I didn't explain what was wrong. I asked: "The job failed, but you're not seeing errors. What might be happening?"

Claude reasoned through it: "Errors might be going to stderr instead of stdout." Then checked `glab ci trace --help`, found nothing about stderr handling, and figured out the solution: `glab ci trace <job-id> 2>&1`. Reran it. Errors appeared.

**After the session**, I asked: "What went wrong? What did you learn?"

Claude listed the issues: forgot to paginate (only saw 20 of 80+ jobs), missed stderr output, didn't know about child pipelines. We talked through each one, then updated the skill file:

```markdown
## Critical Best Practices

1. **Always use --paginate** for job queries
2. **Always capture stderr** with `2>&1` when getting logs
3. **Always check for child pipelines** via bridges API
4. **Limit log output** — use `tail -100` or `head -50`
```

Twenty minutes of reflection. Four critical lessons documented.

## Session Two: Faster, Smarter

"Check pipeline 2965483."

This time, Claude used `--paginate` from the start, captured stderr when pulling logs, and checked for child pipelines via the bridges API. Found a failed child pipeline, got its jobs, identified the error. Start to finish: five minutes.

But something new happened. All 15 Image build jobs failed. Claude started pulling logs for each one. I watched it fetch the first three — all identical errors. The base Docker image was missing from ECR.

"You just pulled three identical error messages," I pointed out. "What does that tell you?"

Claude recognised the pattern: "When multiple jobs of the same type fail, they likely have the same error. I should check one representative job instead of all 15."

Added to the skill file:

```markdown
## Pattern: Multiple Failed Jobs

When many jobs fail (e.g., all Image builds), check one representative job first.

FIRST_FAILED=$(glab api "projects/2558/pipelines/<PIPELINE_ID>/jobs" --paginate |\
  jq -r '.[] | select(.status == "failed") | .id' | head -1)

glab ci trace $FIRST_FAILED 2>&1 | tail -100
```

## Session Three: Institutional Knowledge

Third investigation. Checkout server build timed out. Claude saw the error, started digging.

"Wait," I said. "Before you investigate, check the duration."

Claude checked: 44 minutes. "That's within normal range for checkout server builds," I told it. "This is a known issue, not an actual failure."

Added to the skill file:

```markdown
## Common Error Patterns

Build Timeout:
ERROR: Job failed: execution took longer than <time>
→ Checkout server builds can take 44+ minutes (known issue)

Missing Docker Image:
manifest for <image> not found: manifest unknown
→ Base runner image not available in ECR (common during Node version transitions)
```

By session three, the skill file had accumulated pitfalls to avoid:

```markdown
## Common Pitfalls

- ❌ Forgetting `--paginate` (only gets first 20 jobs)
- ❌ Not checking child pipelines (missing UI Test/Deploy jobs)
- ❌ Confusing Pipeline IDs (~2M) with Job IDs (~20M+)
- ❌ Missing stderr output (forgetting `2>&1`)
- ❌ Dumping entire logs (use tail/head/grep)
```

This is no longer just a command reference. It's institutional knowledge about this specific codebase.

## Why CLI Tools Enable This

CLI tools provide everything an agent needs for self-correction:

**Clear errors**: When `glab api "projects/2558/pipelines/invalid"` fails, stderr shows: "404 Not Found - Pipeline not found." The error tells you exactly what went wrong.

**Exit codes**: Every command returns 0 for success, non-zero for failure. The agent knows a command failed before reading any output.

**Help flags**: Run `glab ci trace --help` and see every flag, every option, complete syntax. Self-service documentation that's always current.

**Immediate feedback**: Try something, see if it works, adjust, try again. The loop is tight.

The help flag tells you what's possible. The skill file captures what's effective. Together, they create a learning environment where the agent improves with each session.

## The Result

Three sessions. Two hours total, including reflection time. The skill file went from basic command syntax to 200 lines of documented patterns, common errors, project-specific quirks, and investigation strategies.

I didn't write comprehensive documentation up front. The agent and I built it together through use, through failure, through reflection.

After each session, I ask the same questions: What went wrong? What went well? What should you do differently? Then we update the skill file. The next session starts better.

[That's how this skill developed.](https://gist.github.com/szymdzum/304645336c57c53d59a6b7e4ba00a7a6)

---

## When CLI Tools Are Enough

URL: https://kumak.dev/when-cli-tools-are-enough/
Published: 2025-11-15
Category: opinion

> One year after MCP's launch, Anthropic published a solution to context overhead: filesystem-based tool discovery. The pattern looked familiar.

A year ago, Anthropic launched the Model Context Protocol. The community built thousands of MCP servers. Two weeks ago, Anthropic published [thoughts on making MCP more efficient](https://www.anthropic.com/engineering/code-execution-with-mcp): specifically, how to reduce token overhead when agents connect to many tools.

Their solution proposes generating a filesystem structure where each MCP tool becomes a TypeScript file. The agent explores this filesystem to discover tools, reads only the definitions it needs, and writes code to orchestrate them. This reduced their example from 150,000 tokens to 2,000 tokens.

What caught my attention: they're building a system where agents discover executable tools by exploring a file system structure. That's what `/usr/bin` has been since the invention of Unix.

## The Wrapper Pattern

Many popular MCP servers wrap CLI tools that already output structured data. The pattern is consistent: receive MCP request, parse it, construct CLI command, execute, parse output, serialise back to MCP format.

SQLite's command-line tool has supported JSON output since version 3.33.0 in 2020:

```bash
sqlite3 mydb.db 'SELECT * FROM users WHERE active = true' -json
```

The GitHub CLI has had native JSON output since 2021:

```bash
gh pr list --json number,title,author,state
gh issue create --title "Bug report" --body "Description"
```

Multiple MCP servers explicitly describe themselves as "wrappers around the glab CLI tool" or similar. They add a serialisation layer to something that already outputs structured data.

This isn't theoretical overhead. Twilio's [performance testing](https://www.twilio.com/en-us/blog/developers/twilio-alpha-mcp-server-real-world-performance) of their MCP server showed concrete results: they reduced API calls by 21.6% but increased cache reads by 28.5% and cache writes by 53.7%. **Overall cost per task increased by 23.5%.** They wrapped an API they already had access to, added an MCP layer to optimise it, and ended up paying more in tokens.

The Gemini CLI team [documented](https://github.com/google-gemini/gemini-cli/issues/4544) "a significant delay of approximately 8 to 12 seconds before the tool becomes responsive" every time an MCP server launches. Not a one-time setup cost. Every session. CLI tools are instant.

## When MCP Actually Makes Sense

MCP isn't unnecessary. There are legitimate use cases:

**Proprietary internal systems** without CLIs, where adding a CLI isn't straightforward. Your company's custom ticketing system with complex state management and no external API designed for programmatic access benefits from an MCP server.

**Complex OAuth flows** requiring browser-based authentication with multiple redirects and token management. Some services make programmatic access deliberately difficult.

**Enterprise security requirements** that need audit logging, fine-grained permissions, and session management beyond standard Unix permissions.

**Remote services** with SDK requirements, stateful connections, or client-side processing that can't be done with simple HTTP requests.

These are real problems MCP solves well. But they're not the typical case. Most database access, version control operations, and file operations already have excellent CLI tooling.

## Check the CLI First

Anthropic's conclusion is telling: they note that "many of the problems here feel novel (context management, tool composition, state persistence), they have known solutions from software engineering." They're right. Shell scripts, pipes, redirects, and the filesystem have solved them for decades.

Twilio concluded that "many builders may start with MCP for convenience, but later transition to custom tool integrations once they know exactly what their agent needs." That transition often means realising the CLI already does what you need.

Before adding an MCP server, check if a CLI tool exists. If it does, try it. Agents like Claude are good at using command-line tools. They understand bash, construct pipelines, and parse structured output. These capabilities have been reliable since launch.

Choose the simplest tool that solves your problem. Sometimes that's MCP. Often, it's the CLI you already have installed.

---