# The Null Hypothesis > Front-end, LLM tooling, and what actually works. Author: Kumak Site: https://kumak.dev --- ## Adding llms.txt to Your Astro Blog URL: https://kumak.dev/adding-llms-txt-to-astro/ Published: 2025-11-30 Category: tutorial > How to implement the llms.txt standard in Astro without dependencies. Three endpoints, ~150 lines of TypeScript, and your content becomes agent-accessible. ## What is llms.txt? The [llms.txt specification](https://llmstxt.org/) proposes a standard location for LLM-readable content. Think of it like `robots.txt` for crawlers or `sitemap.xml` for search engines, but designed for AI agents. The problem it solves: when an AI agent visits your website, it has to parse HTML, navigate around headers, footers, and sidebars, and extract the actual content. This wastes tokens and often produces messy results. The solution: provide a clean, structured text file at a known location. Agents fetch `/llms.txt`, get a table of contents, and can request individual pieces of content in plain markdown. ## The Architecture We'll build three endpoints that work together: ``` /llms.txt → Index: "Here's what I have" /llms-full.txt → Everything: "Here's all of it at once" /llms/[slug].txt → Individual: "Here's just this one post" ``` **Why three?** Different agents have different needs: - A quick lookup might only need the index to find one relevant post - A RAG system might want everything in one request - A focused query might want just one article without the overhead of the full dump ## File Structure Here's where everything lives in your Astro project: ``` src/ ├── utils/ │ └── llms.ts # All the generation logic ├── pages/ │ ├── llms.txt.ts # Index endpoint │ ├── llms-full.txt.ts # Full content endpoint │ └── llms/ │ └── [slug].txt.ts # Per-post endpoints (dynamic route) ``` The `utils/llms.ts` file contains all the logic. The page files are thin wrappers that call into it. This separation keeps the endpoints clean and the logic testable. ## Prerequisites Before we start, you'll need these project-specific pieces: - **`siteConfig`** - An object with `name`, `description`, `url`, and `author` properties - **`getAllPosts()`** - A function that returns your content collection posts - **`BlogPost`** - The type from Astro's content collections with `slug`, `body`, and `data` The [complete gist](https://gist.github.com/szymdzum/a6db6ff5feb0c566cbd852e10c0ab0af) shows the full implementation with all type definitions. ## Part 1: Type Definitions Let's start by defining the shapes of our data. Good types make the rest of the code self-documenting. ```typescript // src/utils/llms.ts // Basic item for the index - just enough to create a link interface LlmsItem { title: string; description: string; link: string; } // Extended item for full content - includes the actual post data interface LlmsFullItem extends LlmsItem { pubDate: Date; category: string; body: string; } ``` Why two types? The index only needs titles and links. The full content dump needs everything. By extending `LlmsItem`, we ensure consistency while allowing the richer type where needed. Now the configuration types for each generator: ```typescript // Config for the index endpoint interface LlmsTxtConfig { name: string; description: string; site: string; items: LlmsItem[]; optional?: LlmsItem[]; // Links that agents can skip if context is tight } // Config for the full content endpoint interface LlmsFullTxtConfig { name: string; description: string; author: string; site: string; items: LlmsFullItem[]; } // Config for individual post endpoints interface LlmsPostConfig { post: BlogPost; site: string; link: string; } ``` The `optional` field in `LlmsTxtConfig` is part of the spec. It signals to agents: "these links are nice-to-have, skip them if you're running low on context window." ## Part 2: The Document Builder Every endpoint needs to return a plain text `Response`. Instead of repeating this logic, we create one builder that handles it all: ```typescript function doc(...sections: (string | string[])[]): Response { const content = sections .flat() // Flatten nested arrays .join("\n") // Join with newlines .replace(/\n{3,}/g, "\n\n") // Normalize multiple blank lines to just one .trim(); // Clean up edges return new Response(content + "\n", { headers: { "Content-Type": "text/plain; charset=utf-8" }, }); } ``` **Why rest parameters with arrays?** This lets us compose documents flexibly: ```typescript // These all work: doc("# Title", "Some text"); doc(["# Title", "", "Some text"]); doc(headerArray, bodyArray, footerArray); ``` **Why normalize newlines?** When composing from multiple arrays, you might accidentally get three or four blank lines in a row. The regex `/\n{3,}/g` catches any run of 3+ newlines and replaces it with exactly 2 (one blank line). Clean output, no matter how messy the input. ## Part 3: Helper Functions Small, focused functions that each do one thing: ### Formatting Dates ```typescript function formatDate(date: Date): string { return date.toISOString().split("T")[0]; } ``` Takes a Date, returns `"2025-11-30"`. The `split("T")[0]` trick extracts just the date part from an ISO string like `"2025-11-30T00:00:00.000Z"`. ### Building Headers ```typescript function header(name: string, description: string): string[] { return [`# ${name}`, "", `> ${description}`]; } ``` Returns an array of lines. The empty string creates a blank line between the title and the blockquote description. This matches the llms.txt spec format. ### Building Link Lists ```typescript function linkList(title: string, items: LlmsItem[], site: string): string[] { return [ "", `## ${title}`, ...items.map((item) => `- [${item.title}](${site}${item.link}): ${item.description}`), ]; } ``` Creates a section with an H2 heading and a markdown list of links. Each link includes a description after the colon. The leading empty string ensures a blank line before the section. ### Building Post Metadata ```typescript function postMeta(site: string, link: string, pubDate: Date, category: string): string[] { return [`URL: ${site}${link}`, `Published: ${formatDate(pubDate)}`, `Category: ${category}`]; } ``` Three lines of metadata for each post. This keeps the format consistent across the full dump and individual post endpoints. ## Part 4: Stripping MDX Syntax If you use MDX, your post bodies contain things agents don't need: ```mdx The actual content starts here... ``` We need to strip the import and the JSX component, but keep the markdown content: ```typescript const MDX_PATTERNS = [ /^import\s+.+from\s+['"].+['"];?\s*$/gm, // import statements /<[A-Z][a-zA-Z]*[^>]*>[\s\S]*?<\/[A-Z][a-zA-Z]*>/g, // JSX blocks like /<[A-Z][a-zA-Z]*[^>]*\/>/g, // Self-closing JSX like ] as const; function stripMdx(content: string): string { return MDX_PATTERNS.reduce((text, pattern) => text.replace(pattern, ""), content).trim(); } ``` **How the patterns work:** 1. **Import pattern**: Matches lines starting with `import`, followed by anything, then `from` and a quoted path. The `m` flag makes `^` match line starts. 2. **JSX block pattern**: Matches `` for components without children. **Why PascalCase?** JSX components use PascalCase by convention. HTML elements are lowercase. So `` gets stripped, but `
` or `` passes through. This also means code examples in fenced blocks are safe, since they're not parsed as actual JSX. ## Part 5: The Generators Now we combine everything into the three main functions: ### Index Generator ```typescript export function llmsTxt(config: LlmsTxtConfig): Response { const sections = [ header(config.name, config.description), linkList("Posts", config.items, config.site), ]; if (config.optional?.length) { sections.push(linkList("Optional", config.optional, config.site)); } return doc(...sections); } ``` Builds an array of sections, conditionally adds the optional section, then passes everything to `doc()`. The spread operator `...sections` unpacks the array into separate arguments. **Output looks like:** ```markdown # Site Name > Site description ## Posts - [Post Title](https://site.com/llms/post-slug.txt): Post description ## Optional - [About](https://site.com/about): About the author ``` ### Full Content Generator ```typescript export function llmsFullTxt(config: LlmsFullTxtConfig): Response { const head = [ ...header(config.name, config.description), "", `Author: ${config.author}`, `Site: ${config.site}`, "", "---", ]; const posts = config.items.flatMap((item) => [ "", `## ${item.title}`, "", ...postMeta(config.site, item.link, item.pubDate, item.category), "", `> ${item.description}`, "", stripMdx(item.body), "", "---", ]); return doc(head, posts); } ``` **Why `flatMap`?** Each item produces an array of lines. Using `map` would give us an array of arrays. `flatMap` maps and flattens in one step, giving us a single array of all lines. The horizontal rules (`---`) separate posts visually and give agents clear boundaries between content pieces. ### Individual Post Generator ```typescript export function llmsPost(config: LlmsPostConfig): Response { const { post, site, link } = config; const { title, description, pubDate, category } = post.data; return doc( `# ${title}`, "", `> ${description}`, "", ...postMeta(site, link, pubDate, category), "", stripMdx(post.body ?? ""), ); } ``` The simplest generator. Destructures the config and post data, then builds a single document. The `post.body ?? ""` handles the edge case of a post without body content. ## Part 6: Data Transformers We need functions to convert Astro's content collection format into our types: ```typescript export function postsToLlmsItems( posts: BlogPost[], formatUrl: (slug: string) => string, ): LlmsItem[] { return posts.map((post) => ({ title: post.data.title, description: post.data.description, link: formatUrl(post.slug), })); } export function postsToLlmsFullItems( posts: BlogPost[], formatUrl: (slug: string) => string, ): LlmsFullItem[] { return posts.map((post) => ({ ...postsToLlmsItems([post], formatUrl)[0], pubDate: post.data.pubDate, category: post.data.category, body: post.body ?? "", })); } ``` **Why the callback for URLs?** Different endpoints need different URL formats: - Index links to `/llms/post-slug.txt` (the plain text version) - Full content links to `/post-slug` (the HTML version) By passing the formatter as a callback, the same transformer works for both cases. **Why does `postsToLlmsFullItems` call `postsToLlmsItems`?** DRY principle. The full item includes everything from the basic item, plus extra fields. Instead of duplicating the mapping logic, we reuse it and spread the result. ## Part 7: The Endpoints Now we wire everything up in Astro page files. These are intentionally thin. ### Index Endpoint ```typescript // src/pages/llms.txt.ts export const GET: APIRoute = async () => { const posts = await getAllPosts(); return llmsTxt({ name: siteConfig.name, description: siteConfig.description, site: siteConfig.url, items: postsToLlmsItems(posts, (slug) => `/llms/${slug}.txt`), optional: [ { title: "About", link: "/about", description: "About the author" }, { title: "Full Content", link: "/llms-full.txt", description: "All posts in one file" }, ], }); }; ``` The `APIRoute` type tells Astro this is an API endpoint, not an HTML page. The `.txt.ts` filename means it generates `/llms.txt`. ### Full Content Endpoint ```typescript // src/pages/llms-full.txt.ts export const GET: APIRoute = async () => { const posts = await getAllPosts(); return llmsFullTxt({ name: siteConfig.name, description: siteConfig.description, author: siteConfig.author, site: siteConfig.url, items: postsToLlmsFullItems(posts, (slug) => `/${slug}`), }); }; ``` Almost identical structure. The URL formatter now points to HTML pages since agents reading the full dump might want to reference the original. ### Dynamic Per-Post Endpoints ```typescript // src/pages/llms/[slug].txt.ts export const getStaticPaths: GetStaticPaths = async () => { const posts = await getAllPosts(); return posts.map((post) => ({ params: { slug: post.slug }, props: { post }, })); }; export const GET = ({ props }: { props: { post: BlogPost } }) => { return llmsPost({ post: props.post, site: siteConfig.url, link: `/${props.post.slug}`, }); }; ``` **What's `getStaticPaths`?** Astro needs to know at build time which pages to generate. This function returns an array of all valid slugs. Each entry includes `params` (the URL parameters) and `props` (data passed to the page). **Why `[slug]` in the filename?** Square brackets denote a dynamic route in Astro. The file `[slug].txt.ts` generates `/llms/post-one.txt`, `/llms/post-two.txt`, etc. ## Part 8: Discovery Agents need to find your llms.txt. The spec says to put it at the root (`/llms.txt`), similar to `robots.txt`. But you can also advertise it in HTML: ```html ``` Add this to your base layout or head component, wherever you define other `` tags like RSS or favicon. This isn't part of the official spec, but follows web conventions. You can also register your site on directories like [llmstxt.site](https://llmstxt.site). ## Limitations This implementation works for **content collections with markdown or MDX bodies**. It reads `post.body` directly, which is raw text. For component-based pages (React, Vue, Svelte, or plain `.astro` files), there's no markdown body to extract. You'd need a different strategy: - Render to HTML and strip tags (lossy, messy) - Maintain separate content files (duplicate effort) - Use a headless CMS where content exists independently For most blogs, content collections are the right choice anyway. ## Why Not Use a Library? There are Astro integrations for llms.txt. They auto-generate from all pages at build time. Sounds convenient, but: 1. You get everything, including pages you might not want exposed 2. No per-post endpoints 3. No control over the output format 4. Another dependency to maintain This implementation is ~150 lines of TypeScript. You control exactly what's included. You understand every line. For something this simple, the DIY approach wins. ## Bonus: An SVG Icon The llms.txt logo is four rounded squares in a plus pattern. Here's a simple SVG version you can use in your navigation: ```html ``` **Design notes:** - **`viewBox="-4 -4 32 32"`** adds padding so the icon matches the visual weight of stroke-based icons like Lucide - **`fill="currentColor"`** inherits from CSS, so it works with any color scheme - **Varying opacity** (0.6, 0.7, 0.8, 1.0) gives depth without using multiple colors - **`rx="2"`** rounds the corners to match the original logo style For Astro, wrap it in a component so you can pass `size` as a prop and reuse it across your site. ## The Result After deploying, you have: - `/llms.txt` - Index listing all posts with descriptions - `/llms-full.txt` - Complete content for RAG systems or full context - `/llms/post-slug.txt` - Individual posts for focused queries Agents fetch the index, pick what they need, and get clean markdown. No HTML parsing, no navigation noise, no wasted tokens. That's the point of the standard. --- ## Testing in the Age of AI Agents URL: https://kumak.dev/testing-in-the-age-of-ai-agents/ Published: 2025-11-29 Category: philosophy > When code changes at the speed of thought, tests become less about verification and more about defining what should remain stable. AI agents don't just write code faster. They make rewriting trivial. Your codebase becomes fluid, reshaping itself as fast as you can describe what you want. But when everything is in flux, how do you know the features still work? Something needs to hold the shape while everything inside it moves. That something is your tests. Not tests that document how the code works today, but tests that define what it must always do. ## The Contract Principle The obvious purpose of tests is "catching bugs." But that's incomplete. Tests define what "correct" means. They're a contract: this is what the system must do. Everything else is negotiable. Kent Beck captured this precisely: tests should be "sensitive to behaviour changes and insensitive to structure changes." A test that breaks when behaviour changes is valuable. A test that breaks when implementation changes, but behaviour stays the same, is actively punishing you for improving your code. The difference is stark in practice: ```typescript // Testing implementation - breaks when you refactor test('calls navItemVariants with correct params', () => { const spy = vi.spyOn(styles, 'navItemVariants'); render(); expect(spy).toHaveBeenCalledWith({ active: false }); }); // Testing contract - survives any rewrite test('renders as a link to the specified route', () => { render(); const link = screen.getByRole('link', { name: /orders/i }); expect(link).toHaveAttribute('href', '/orders'); }); ``` The first test knows the component uses a function called `navItemVariants`. Tomorrow, you might rename that function or eliminate it entirely. The test breaks. The component still works. The second test knows only what matters: there's a link, it goes to `/orders`, it says "Orders." Rewrite the entire component. Swap the styling system. As long as users can click a link to their orders, the test passes. ``` $ npm test ❌ FAIL src/components/NavItem.test.tsx ✕ calls navItemVariants with correct params ✕ passes active prop to styling function ✕ renders with expected className 3 tests failed. The component works perfectly. 🙃 ``` The tests haven't caught a bug. The behaviour is identical. You're just paying a tax on change. ## The Black Box Treat every module like a black box. You know what goes in. You know what should come out. What happens inside is none of your tests' business. This clarifies what to mock. External systems (APIs, databases, third-party services) exist outside your black box. Mock those. Your own modules exist inside. Don't mock those; let them run. ```typescript // Mock external systems - they're outside your control vi.mock('~/api/client', () => ({ fetchUser: vi.fn().mockResolvedValue({ name: 'Test User' }), })); // Don't mock your own code - let it run // ❌ vi.mock('~/components/ui/NavItem'); // ❌ vi.spyOn(myModule, 'internalHelper'); ``` When you mock your own code, you're encoding the current implementation into your tests. When the implementation changes, your mocks become lies. They describe a structure that no longer exists, and your tests pass while your code breaks. A useful heuristic: before committing a test, imagine handing the module's specification to a developer who'd never seen your code. They implement it from scratch, differently. Would your tests pass? If yes, you've tested the contract. If no, go back and fix the test. ## The Circular Verification Problem Here's where AI changes everything. Tests exist to verify that code is correct. If AI writes both the code and the tests, what verifies what? The test was supposed to catch AI mistakes. But AI wrote the test. You've created a loop with no external reference point. > AI writes code → AI writes tests → tests pass → "correct"? Black box tests break this circularity because they're human-auditable. When a test says "there's a link that goes to `/orders`," you can read that assertion and verify it matches the requirement. You don't need to understand implementation details. Implementation-coupled tests aren't auditable this way. To verify the test is correct, you'd need to understand the implementation it's coupled to. You're back to trusting AI about AI's work. This suggests specific rules: **Treat assertions as immutable.** AI can refactor how a test runs: the setup, the helpers, the structure. AI should not change what a test asserts without explicit human approval. The assertion is the contract. ```typescript // AI can change this (setup) const user = await setupTestUser({ role: 'admin' }); // AI should NOT change this (assertion) without approval expect(user.canAccessDashboard()).toBe(true); ``` **Failing behaviour tests require human attention.** When a contract-level test fails, AI shouldn't auto-fix it. The failure is information. A human must decide: is this a real bug, or did requirements change? **Separate creation from modification.** AI drafting new tests for new features is relatively safe. AI modifying existing tests is riskier. New tests add coverage. Modified tests might silently remove it. ## What Not to Test Simple, obvious code doesn't need tests. A component that renders a string as a heading doesn't need a test proving it renders a heading. A utility that concatenates paths doesn't need a test for every combination. Test complex logic. Test edge cases. Test error handling. Test anything where a bug would be non-obvious or expensive to find later. ```typescript // Congratulations, you've tested JavaScript test('banana equals banana', () => { expect('🍌').toBe('🍌'); // ✅ PASS }); ``` Don't test that React renders React components. Don't test that TypeScript types are correct. Your test suite isn't a proof of correctness; it's a net that catches bugs that matter. This restraint has a benefit: a smaller, focused test suite is easier to audit. When every test has a clear purpose, you can review what AI wrote and verify it matches intent. ## The Coverage Trap Coverage measures execution, not intent. A test that executes a line of code isn't necessarily testing that the line does what it should. Worse, coverage as a target incentivises exactly the wrong kind of tests. Need to hit 80%? Write tests that spy on every function, assert on every intermediate value. You'll hit your number. You'll also create a test suite that breaks whenever anyone improves the code. ```typescript // Written for coverage, not for value test('increases coverage', () => { const result = processOrder(mockOrder); expect(processOrder).toHaveBeenCalled(); // So what? expect(result).toBeDefined(); // Still nothing }); // Written for behaviour test('completed orders update inventory', () => { const order = createOrder({ items: [{ sku: 'ABC', quantity: 2 }] }); processOrder(order); expect(getInventory('ABC')).toBe(initialStock - 2); }); ``` The real question isn't "how much code did my tests execute?" It's "would my tests catch a bug that matters?" ## A Philosophy for Flux Tests are how you know code is correct. When both code and tests are fluid, when AI can change either at will, you lose the ability to verify anything. The test that passed yesterday means nothing if it was rewritten to match today's code. The philosophy is simple: > Test what the code does, not how it does it. Tests become specifications, not surveillance. They define what matters, not document what exists. And because they encode observable behaviour rather than internal structure, they remain human-auditable even when AI writes them. When code is in constant flux, tests are your fixed point. They're stable not because change is expensive, but because they define what "correct" means. Without that fixed point, you have no way to know if your fluid code is flowing in the right direction. --- ## Self-Documenting CLI Design for LLMs URL: https://kumak.dev/self-documenting-cli-design-for-llms/ Published: 2025-11-28 Category: philosophy > Agents start fresh every session. Instead of dumping docs upfront, build tools they can query. One take on agent-friendly tooling. I'm building a CLI tool for browser debugging. It lets AI agents control Chrome through the DevTools Protocol: capture screenshots, inspect network requests, execute JavaScript. The Chrome DevTools Protocol has 53 domains and over 600 methods. That's a lot of capability and a lot of documentation. Here's the problem: how do I teach an agent what's possible without dumping thousands of tokens into context every session? Documentation is a wall of text about things you don't need yet. Worse, it drifts. The tool ships a new version, someone forgets to update the docs, and now the agent is following instructions for a method that was renamed three months ago. The tool and its documentation are two artifacts pretending to be one. When Claude gets stuck with CLI tools, it naturally reaches for `--help`. When that's not enough, it tries `command subcommand --help`. The pattern is consistent: ask the tool, learn from the response, try again. If `--help` is the agent's natural discovery method, how far can you push it? ## Progressive Disclosure Instead of documenting everything upfront, make every layer queryable. Watch the conversation unfold: ```shell # Agent asks: "What can you do?" bdg --help --json # Agent asks: "What domains exist?" bdg cdp --list # Agent asks: "What can I do with Network?" bdg cdp Network --list # Agent asks: "How do I get cookies?" bdg cdp Network.getCookies --describe # Agent executes with confidence bdg cdp Network.getCookies ``` Each answer reveals exactly what's needed for the next question. Five interactions, zero documentation. The tool taught itself. When the agent doesn't know the exact method name, semantic search bridges the gap: ```shell $ bdg cdp --search cookie Found 14 methods matching "cookie": Network.getCookies # Returns all browser cookies for the current URL Network.setCookie # Sets a cookie with the given cookie data Network.deleteCookies # Deletes browser cookies with matching name ... ``` The agent thinks "I need something with cookies" and the tool finds everything relevant. No guessing required. ## Errors That Teach Actionable error messages have been a UX best practice for decades. What's different for agents is the stakes: humans can work around bad UX by searching Stack Overflow. Agents can't. They're stuck with what you give them, racing against a context window that's always shrinking. And agents make mistakes constantly. They'll type `Network.getCokies` instead of `Network.getCookies`. They'll invent plausible-sounding methods that don't exist. A typical error: ```shell $ bdg cdp Network.getCokies Error: Method not found ``` Now what? The agent has to guess, search, retry. Burn tokens. Teaching errors provide the path forward: ```shell $ bdg cdp Network.getCokies Error: Method 'Network.getCokies' not found Did you mean: - Network.getCookies - Network.setCookies - Network.setCookie ``` The correction arrives in the same response as the error. No round trip. The agent adapts immediately. The fuzzy matching goes beyond typos. Try `Networking.getCookies` with the wrong domain name, and it still suggests `Network.getCookies`. The tool understands what you meant, not just what you typed. Even empty results guide forward: ```shell $ bdg dom query "article h2" No nodes found matching "article h2" Suggestions: Verify: bdg dom eval "document.querySelector('article h2')" List: bdg dom query "*" ``` And success states show next steps: ```shell $ bdg dom query "h1, h2, h3" Found 5 nodes: [0]

Recent Posts [1]

Testing in the Age of AI Agents ... Next steps: Get HTML: bdg dom get 0 Details: bdg details dom 0 ``` Every interaction answers "what now?" Errors suggest fixes. Empty results suggest alternatives. Success shows what to do with the data. ## Semantic Exit Codes Most tools return 1 for any error. Not helpful. Semantic exit codes create ranges with meaning: - **80-89**: User errors. Bad input, fix it before retrying. - **100-109**: External errors. API timeout, retry with backoff. The agent can branch its logic without parsing error messages. Message, suggestion, exit code: three layers of guidance stacked together. ## The Result I tested this with an agent starting from zero knowledge. No prior context, no documentation provided. Just the tool. Five commands later, it was executing CDP methods successfully. It discovered the tool's structure, explored the domains, found the method it needed, understood the parameters, and executed. When I introduced typos deliberately, the suggestions caught them. When commands failed, the exit codes pointed toward solutions. The agent recovered without external help. The context cost? Roughly 500 tokens for discovery, versus thousands for a documentation dump. And those 500 tokens bought understanding, not just information. ## Design for Dialogue External documentation will always drift from reality. The tool itself never lies about its own capabilities. Tools designed for agents aren't dumbed down. They're more explicit. They expose their structure. They teach through interaction rather than requiring upfront reading. Design for dialogue, not documentation. --- ## CLI vs MCP on Chrome DevTools Protocol URL: https://kumak.dev/cli-vs-mcp-benchmarking-browser-automation/ Published: 2025-11-23 Category: opinion > I ran benchmarks comparing CLI tools against MCP servers for browser automation. 13x more token efficient with CLI. Here's what I found. Anthropic published an observation about [MCP and code execution](https://www.anthropic.com/engineering/code-execution-with-mcp): executable code in the filesystem might be more efficient for AI agents than protocol servers. Isn't that what CLI tools already are? I tested this by comparing two approaches to browser automation: - **CLI**: [bdg](https://github.com/szymdzum/browser-debugger-cli), a browser debugger CLI I built - **MCP**: [Chrome DevTools MCP](https://github.com/ChromeDevTools/chrome-devtools-mcp), the official Chrome DevTools protocol server Both interact with the Chrome DevTools Protocol. Same underlying capabilities. The question: does the interface matter? ## Methodology Fresh Claude instance (Sonnet 4.5) with zero prior knowledge of either tool. Identical tasks across three websites: Hacker News (navigation, extraction), CodePen (inspection, screenshots), Amazon (product data, anti-bot stress test). No human guidance. **Full methodology**: [BENCHMARK_PROMPT.md](https://github.com/szymdzum/browser-debugger-cli/blob/main/docs/benchmarks/BENCHMARK_PROMPT.md) ## Results ### Token Efficiency: 13x Difference | Tool | Total Tokens | Per Test Average | | -------------- | ---------------------- | ---------------- | | **bdg (CLI)** | 6,500 | ~2,200 | | **Chrome MCP** | 85,500 | ~28,500 | | **Difference** | **13x more efficient** | | The gap comes from how each tool returns information: **MCP**: Full accessibility snapshots. Every page state returns a complete accessibility tree. Amazon product page alone: **52,000 tokens** in one snapshot. **CLI**: Targeted queries. `bdg dom query ".athing"` returns 30 Hacker News stories in 1,200 tokens. ### Command Count | Test | bdg Commands | MCP Calls | | ----------- | ------------ | --------- | | Hacker News | 11 | 8 | | CodePen | 6 | 5 | | Amazon | 4 | 3 | MCP requires slightly fewer calls. But when one approach uses 13x more tokens per call, command count becomes less relevant. ### Discovery: Zero-Knowledge Learning The most interesting result was watching how the agent learned each tool. **bdg discovery path** (5 commands to first successful execution): ```bash bdg --help --json # Agent learns: 10 commands, exit codes, task mappings bdg cdp --list # Agent learns: 53 domains available bdg cdp Network --list # Agent learns: 39 methods in Network domain bdg cdp --search cookie bdg cdp Network.getCookies --describe # Agent discovers parameters, return types, examples bdg cdp Network.getCookies # Successful execution ``` Zero external documentation. The tool taught itself through introspection. **MCP discovery**: Requires understanding MCP protocol, parsing 10k+ token accessibility trees to find element UIDs, using UID-based selection from snapshots. ### Element Selection **bdg**: Standard CSS selectors ```bash bdg dom query ".athing" # Hacker News stories bdg dom query "#productTitle" # Amazon product bdg dom click "button[type=submit]" ``` **MCP**: UID-based from accessibility tree ```javascript take_snapshot({}); // Returns 10k+ tokens click({ uid: '1_28' }); // Must find UID in snapshot first ``` CSS selectors are familiar; UID-based selection is more robust for dynamic content. Trade-offs exist. ## Why the Gap Exists **Return payload size**: MCP's accessibility snapshots include everything. CLI queries return only what you ask for. This is the primary driver of the 13x difference. **Error handling**: CLI failures return structured errors with exit codes and suggestions. The agent can self-correct: ```bash $ bdg dom click ".missing-button" Error: Element not found: .missing-button Exit code: 81 (user error) Suggestions: - Verify selector: bdg dom query ".missing-button" - List all buttons: bdg dom query "button" ``` **Composability**: CLI tools integrate with Unix: ```bash bdg peek --network | jq '.[] | select(.status >= 400)' bdg dom query "button" | jq '.[0].nodeId' | xargs bdg dom click ``` If the tool doesn't provide exactly what you need, compose it with `jq`, `grep`, or `awk`. MCP servers require protocol extensions. **Protocol access**: bdg exposes all 644 CDP methods across 53 domains. MCP servers expose curated subsets. Need `Profiler.startPreciseCoverage`? It's there via `bdg cdp Profiler.startPreciseCoverage`. ## Limitations - **Small sample**: 3 websites - **Single model**: Claude Sonnet 4.5 only - **Specific scenarios**: Information extraction workflows - **Bot detection**: Both tools faced blocks on some sites The token efficiency advantage should apply equally to debugging workflows (console streaming, network inspection, profiling), but I didn't benchmark those specifically. ## Takeaway For browser automation with AI agents, CLI tools proved 13x more token-efficient than MCP servers accessing the same underlying protocol. The difference comes from targeted queries vs. full snapshots, plus the ability to compose with Unix tools and access the complete CDP surface. This isn't definitive. It's one data point. Your needs may differ. **Full results**: [BENCHMARK_RESULTS_2025-11-23.md](https://github.com/szymdzum/browser-debugger-cli/blob/main/docs/benchmarks/BENCHMARK_RESULTS_2025-11-23.md) --- ## How My Agent Learned GitLab URL: https://kumak.dev/how-my-agent-learned-gitlab/ Published: 2025-11-17 Category: tutorial > Teaching an agent to use CLI tools isn't about writing perfect documentation. It's about creating a feedback loop where the tool teaches, the agent learns, and reflection builds institutional knowledge. I work with a monorepo that has over 80 CI/CD jobs across 12 stages. When pipelines fail, I need to trace through parent pipelines, child pipelines, failed jobs, and error logs. There's an MCP server for GitLab. I tried it once, then installed `glab` and wrote a basic [skill file](https://gist.github.com/szymdzum/304645336c57c53d59a6b7e4ba00a7a6) with command examples. What's interesting isn't the skill itself. It's how it developed through three investigation sessions. ## Session One: Real-Time Self-Correction "Investigate pipeline 2961721" was my first request. Claude ran a command. Got 20 jobs back. The pipeline had 80+. I watched Claude notice the discrepancy, run `glab api --help`, spot the `--paginate` flag, and try again. This time: all the jobs. Then it pulled logs with `glab ci trace `. The logs looked clean. No errors visible. But the job had definitely failed. I didn't explain what was wrong. I asked: "The job failed, but you're not seeing errors. What might be happening?" Claude reasoned through it: "Errors might be going to stderr instead of stdout." Then checked `glab ci trace --help`, found nothing about stderr handling, and figured out the solution: `glab ci trace 2>&1`. Reran it. Errors appeared. **After the session**, I asked: "What went wrong? What did you learn?" Claude listed the issues: forgot to paginate (only saw 20 of 80+ jobs), missed stderr output, didn't know about child pipelines. We talked through each one, then updated the skill file: ```markdown ## Critical Best Practices 1. **Always use --paginate** for job queries 2. **Always capture stderr** with `2>&1` when getting logs 3. **Always check for child pipelines** via bridges API 4. **Limit log output** — use `tail -100` or `head -50` ``` Twenty minutes of reflection. Four critical lessons documented. ## Session Two: Faster, Smarter "Check pipeline 2965483." This time, Claude used `--paginate` from the start, captured stderr when pulling logs, and checked for child pipelines via the bridges API. Found a failed child pipeline, got its jobs, identified the error. Start to finish: five minutes. But something new happened. All 15 Image build jobs failed. Claude started pulling logs for each one. I watched it fetch the first three — all identical errors. The base Docker image was missing from ECR. "You just pulled three identical error messages," I pointed out. "What does that tell you?" Claude recognised the pattern: "When multiple jobs of the same type fail, they likely have the same error. I should check one representative job instead of all 15." Added to the skill file: ```markdown ## Pattern: Multiple Failed Jobs When many jobs fail (e.g., all Image builds), check one representative job first. FIRST_FAILED=$(glab api "projects/2558/pipelines//jobs" --paginate |\ jq -r '.[] | select(.status == "failed") | .id' | head -1) glab ci trace $FIRST_FAILED 2>&1 | tail -100 ``` ## Session Three: Institutional Knowledge Third investigation. Checkout server build timed out. Claude saw the error, started digging. "Wait," I said. "Before you investigate, check the duration." Claude checked: 44 minutes. "That's within normal range for checkout server builds," I told it. "This is a known issue, not an actual failure." Added to the skill file: ```markdown ## Common Error Patterns Build Timeout: ERROR: Job failed: execution took longer than