[The Null Hypothesis]

CLI vs MCP on Chrome DevTools Protocol

Published
Reading time
8 min read
Category
opinion

Anthropic published an interesting observation about the Model Context Protocol and code execution. They noted that executable code in the filesystem might be more efficient for AI agents than protocol servers. Isn’t that essentially what CLI tools already are?

I decided to test this hypothesis by comparing two approaches to browser automation with AI agents:

  • CLI approach: bdg - A browser debugger CLI I built
  • MCP approach: Chrome DevTools MCP - The official Chrome DevTools protocol server

Both tools interact with the Chrome DevTools Protocol, so they have access to the same underlying capabilities. The question is: does the interface matter?

Methodology

I used a fresh Claude instance (Sonnet 4.5) with zero prior knowledge of either tool. The agent received identical tasks across three real websites:

  1. Hacker News - Navigate, count stories, extract comments
  2. CodePen - Inspect trending pens, capture screenshots
  3. Amazon - Extract product information (anti-bot stress test)

The goal was to see how agents discover and use each tool naturally, without human guidance.

Full methodology: BENCHMARK_PROMPT.md

Results

Token Efficiency: 13x Difference

This was the most striking finding:

ToolTotal TokensPer Test Average
bdg (CLI)6,500~2,200
Chrome MCP85,500~28,500
Difference13x more efficient-

The gap comes from how each tool returns information:

MCP’s approach: Full accessibility snapshots

  • Every page state = complete accessibility tree
  • Amazon product page alone: 52,000 tokens in one snapshot
  • Includes every element, nested structure, full context

CLI’s approach: Targeted queries

  • CSS selectors return only matching elements
  • bdg dom query ".athing" → 1,200 tokens (30 stories)
  • Progressive disclosure - get what you need, when you need it

Command Count: Roughly Equivalent

Testbdg CommandsMCP CallsWinner
Hacker News118MCP
CodePen65MCP
Amazon43MCP
Average7~5MCP

MCP requires slightly fewer calls for simple tasks. However, when one approach uses 13x more tokens per call, command count becomes less relevant.

Discovery: Zero-Knowledge Learning

One of the most interesting aspects was watching how the agent learned each tool.

bdg discovery path (5 commands):

# 1. What is this tool?
bdg --help --json

# 2. What can CDP do?
bdg cdp --list
# Result: 53 domains available

# 3. What Network methods exist?
bdg cdp Network --list
# Result: 39 methods

# 4. How do I get cookies?
bdg cdp --search cookie
bdg cdp Network.getCookies --describe

# 5. Execute
bdg cdp Network.getCookies

The agent went from zero knowledge to successful execution without external documentation. It taught itself through the tool’s introspection.

MCP discovery path:

  • Requires understanding of MCP protocol
  • Uses UID-based element selection from snapshots
  • Must parse 10k+ token accessibility trees to find elements

Element Selection

bdg: Standard CSS selectors

bdg dom query ".athing"           # Hacker News stories
bdg dom query "#productTitle"     # Amazon product
bdg dom click "button[type=submit]"

MCP: UID-based from accessibility tree

take_snapshot({});
// Returns 10k tokens
click({ uid: "1_28" });
// Must find UID in snapshot first

CSS selectors are more familiar to developers, but UID-based selection is more robust for dynamic content. Trade-offs exist on both sides.

Why CLI Tools Enable Self-Correction

There’s a fundamental difference in how CLI tools and protocol servers handle limitations:

CLI tools expose their constraints explicitly. When a command fails, you get structured errors with exit codes, suggestions, and full context. This enables agents to self-correct:

$ bdg dom click ".missing-button"
Error: Element not found: .missing-button
Exit code: 81 (user error)

Suggestions:
  - Verify selector: bdg dom query ".missing-button"
  - List all buttons: bdg dom query "button"
  - Wait for element: sleep 2 && bdg dom click ".missing-button"

The agent learns what went wrong and how to fix it. Error recovery becomes part of the workflow.

Protocol servers hide implementation gaps. If an MCP server doesn’t expose a specific CDP method, there’s no way to access it. You’re limited to the 28 curated tools the server provides. Need something from the Profiler domain? Security domain? WebAuthn domain? You’re stuck until someone updates the server.

Composability means extensibility. CLI tools integrate with the Unix ecosystem:

# Filter requests by status code
bdg peek --network | jq '.[] | select(.status >= 400)'

# Chain commands for workflows
bdg dom query "button" | jq '.[0].nodeId' | xargs bdg dom click

# Combine with other tools
bdg network getCookies | grep "session" | cut -d: -f2

If bdg doesn’t provide exactly what you need, you can compose it with jq, grep, awk, or any other Unix tool. MCP servers require protocol extensions and server updates.

Full protocol access matters. bdg exposes all 644 CDP methods across 53 domains. If you need Profiler.startPreciseCoverage or Security.setIgnoreCertificateErrors, it’s already there via bdg cdp Profiler.startPreciseCoverage. No waiting for server maintainers to add support.

This isn’t just about efficiency. It’s about not being artificially limited by someone else’s API design decisions.

What This Means

Token Efficiency Compounds

For a single task, 13x might seem manageable. But consider:

  • Debugging session: 20+ page states → 200k vs 15k tokens
  • Multi-step workflow: Navigate, fill forms, verify → tokens add up fast
  • Context window limits: More tokens = less room for reasoning

At scale, this efficiency gap becomes significant.

Self-Documentation Enables Autonomy

The most interesting finding wasn’t the numbers - it was watching the agent learn through introspection:

$ bdg --help --json
# Agent learns: 10 commands, exit codes, task mappings

$ bdg cdp --list
# Agent learns: 53 domains

$ bdg cdp --search cookie
# Agent discovers: 14 cookie-related methods

$ bdg cdp Network.getCookies --describe
# Agent learns: parameters, return types, examples

No external documentation needed. The tool IS the documentation.

Unix Composability Matters

CLI tools compose naturally with Unix tools:

# Filter network requests
bdg peek --last 20 | jq '.[] | select(.status >= 400)'

# Count specific elements
bdg dom query ".error" | jq 'length'

# Chain commands
bdg dom query "button" && bdg dom click "button:first-child"

This flexibility is harder to replicate in protocol-based tools.

Limitations

This benchmark has clear constraints:

  1. Small sample size - Only 3 websites tested
  2. Single model - Only Claude Sonnet 4.5
  3. Specific scenarios - Information extraction workflows
  4. Bot detection - Both tools faced blocks on some sites

On Debugging Workflows

The most common criticism is that this focused on “information extraction” rather than debugging workflows. However, bdg provides comprehensive debugging abstractions:

  • Console debugging: bdg console --follow for real-time error streaming
  • Network debugging: bdg peek --network, bdg tail, bdg network headers
  • Performance profiling: Full CDP access via bdg cdp Profiler --list
  • HAR export: bdg network har for deep network analysis

The token efficiency advantage applies equally to debugging workflows. A debugging session with 20 page states would consume 200k+ tokens via MCP snapshots vs ~15k via targeted bdg queries.

Both tools access the same Chrome DevTools Protocol - the difference is the interface, not the capabilities.

Takeaways

This isn’t a definitive “CLI beats MCP” statement. It’s one data point suggesting:

Token Efficiency: With MCP, you pay upfront for every tool definition and capability declaration - whether you use them or not. CLI tools like glab, jq, and grep were already in the model’s training data. A skill document showing usage patterns is ~3k tokens. MCP server definitions alone can be 5-10k before you invoke any additional functionality.

Composability: Unix philosophy wins here. CLI tools pipe together, each doing one thing well, and you chain them for complex workflows. MCP servers are monolithic endpoints. If it doesn’t expose your exact query, you’re stuck. With CLI, you can grep, pipe to files, and combine tools. The model already knows these patterns.

Debuggability: CLI errors are transparent. You see exactly what failed and why. MCP errors hide behind protocol layers and server logs you can’t access. The model can identify CLI errors, understand them, and adapt accordingly.

Real-Time Evolution: I can update my skill document while the agent uses it, adding patterns and refining examples. With MCP, you’re locked to whatever the server exposes. Want new functionality? Wait for the maintainer to add it, redeploy, hope nothing breaks. With CLI, I just update the markdown.

For my use case (browser automation with AI agents), CLI tools with self-documentation proved more efficient than MCP servers. The ability to compose with Unix tools and access the full CDP surface without waiting for server updates is significant.

Your mileage may vary depending on your needs.

Resources