Preface
01In this document I will not rehash well-known documentation that you can read on the Claude Code Documentation. Instead, I want to highlight the practices that I use myself and that my fellow engineers use.
AI Scaffold
02All the approaches, coding agents, and other tools presented in this document are part of an AI Scaffold on top of LLM models that score well on agent benchmarks such as SWE-bench, HumanEval, and others. At the time of writing, the leader is Anthropic Opus 4.6 (High reasoning).
Most of the functionality of agent systems such as Claude Code, Codex, Gemini can be reproduced outside these systems. For example, the Plan mode in Claude Code — where the agent writes a plan, then breaks it into tasks and starts executing after clearing the context — is not the only correct way to implement such functionality.
As the author, I want to convey that it doesn’t matter which coding agent you use — what matters is your own mindset and ability to build a convenient workflow for your project’s tasks. It’s also worth noting that the same AI Scaffold can produce drastically different results depending on the LLM the agent uses.
The following sections will cover the methodology of working with one specific tool — Claude Code, which uses the best available Anthropic models. The guide can be extrapolated to other agents, but keep in mind that Claude Code is software designed to work with a strict format, file structure, and configurations.
Using agents greatly reduces the need for a rigid configuration structure, because agents can search for configs on your machine. However, you can simplify any coding agent’s job by sticking to the strict structure native to that agent. This simply saves tokens and boosts performance by conserving space in the session’s context window.
Configuration Architecture
03Claude Code allows you to configure settings at the global level, at the project level, and in every subdirectory of the project. When a session starts, Claude Code scans directories from the current one up to the root looking for additional configurations.
Claude Code Configuration Files
| File | In Repo | Purpose | Scope |
|---|---|---|---|
CLAUDE.md | Yes | Project knowledge, architecture, commands, conventions | Every session |
.claude/settings.json | Yes | Shared permissions, hooks, plugin config | Whole team |
.claude/settings.local.json | No | Personal permissions, machine-specific paths | Your machine only |
.claude/skills/*/SKILL.md | Yes | Reusable slash-command workflows | On demand |
.claude/agents/*.md | Yes | Specialized sub-agent definitions | Task tool |
.mcp.json | Yes* | MCP server connections (Figma, Grafana, etc.) | Integrations |
<subdir>/.claude/settings.local.json | No | Directory overrides (enable MCP, extra permissions) | Subdirectory |
Key Principle: Separation of Concerns
Team configuration (settings.json) defines a safe baseline shared by everyone — standard linting commands, git operations, deny rules for destructive actions.
Personal configuration (settings.local.json) extends it with machine-specific paths and additional permissions you trust.
Project knowledge (CLAUDE.md) tells Claude what the project is. Settings tell it what it’s allowed to do.
Why CLAUDE.md and not AGENTS.md
Any text documents can be used to define workflows and instructions for LLM agents. However, if you use AGENTS.md, Claude Code will first try to read CLAUDE.md, fail to find it, and then go looking for AGENTS.md. This wastes extra tokens and doesn’t guarantee the file will actually be found.
When using multiple agent systems simultaneously, use native configurations for each system. For Claude Code — CLAUDE.md.
What Goes in CLAUDE.md
04The quality of generated code is most influenced by the content of CLAUDE.md, because it is always loaded at the beginning of each session’s context. CLAUDE.md can be auto-generated with the /init command, but I recommend adding a few things to it.
Define the Structure
Here is one of the templates I use for almost all my projects. Almost always, if you’re unhappy with the quality on some specific aspect, you should go and add the data manually.
Project Overview ← 2-3 sentences. What does this project DO?
Repository Structure ← Directory tree with one-line descriptions
Build & Development Commands ← Exact commands to build, run, test, lint
Architecture ← Request flow, integration points, namespaces
Per-Service Sections ← Deep-dive for each service
Testing ← How to run tests, what frameworks are used
Linting ← Tools and configs (ruff rules, eslint)
Key Gotchas ← Things that WILL trip Claude up without warning
Code Style ← Language-specific conventions
Verification Before Committing ← The exact commands to run before any commit
Git Workflow ← How your team pushes code (Gerrit, GitHub, etc.)
Commit Convention ← Format, rules, examplesExample: Commit Convention
## Commit Convention
**No type prefixes** (`feat:`, `fix:`, etc.) — write human-readable summaries.
### Format
<short descriptive summary, max 72 chars>
<body: at least one paragraph explaining what changed and why>
### Examples
Resolve website crash that blocked user access
The website was crashing during initialization due to a null reference
in the auth module. Root cause was a missing null check in
SessionProvider when SSO token is expired.MCPs
05Why MCP
Using MCP significantly reduces noise in your agent’s context, helping it forget goals less and save tokens. For example, using a Kubernetes MCP greatly simplifies the agent’s work thanks to ready-made tools that accept and return parameters in the right format.
My MCPs
Which MCP servers you should use depends on your tasks. Here are a few that I use every day:
Don’t overload the context
It’s always tempting to load up on tools — download every useful MCP. But there is a fairly hard limit you shouldn’t exceed.
Each MCP server loads tool information into the context. When importing many MCP servers (more than 4–5), the context window portion for MCP tools can exceed 15k tokens (15% of the Opus 4.6 context window), which is unacceptably large for complex tasks.
Security
06Security is fairly straightforward. Let’s be honest, you probably won’t be able to protect everything. But you can try to avoid the main risks.
In settings.json you can configure Permissions to allow or deny specific commands and actions. You should exclude .env files and other files with private keys from being readable.
You should also be cautious about visiting unknown websites, because any site can contain a Prompt Injection that can negatively affect your agent’s work and lead to sensitive data being compromised.
Permissions configuration example
{
"permissions": {
"allow": [
"Bash(command:*)",
"WebFetch(domain:example.com)",
"mcp__server__tool"
],
"deny": [
"Read(.env)",
"Bash(rm -rf:*)"
]
}
}Skills
07Agent Teams
In February 2026, Anthropic released an update that allows you to launch entire teams of agents that can communicate with each other and execute their tasks within a predefined role.
You can read the technical guide here. And you can enable the feature by adding the following code to settings.json:
{
"env": {
"CLAUDE_CODE_EXPERIMENTAL_AGENT_TEAMS": "1"
}
}I most often use agent teams for discussing complex architecture and making technical decisions. I have my own skill called /expert-panel. It describes several agent roles, what they do, and how they approach problem-solving.
If you give them a complex task, they will argue, debate, negotiate, and most often reach an agreement or ask for my opinion, laying out their thoughts in detail (each from their own perspective). You can try the skill by adapting it for your project and giving it to Claude Code.
--- name: expert-panel description: "Use this skill when the user wants to brainstorm, evaluate, or design a plan using a multi-expert panel approach. Triggers include: requests for architecture review, infrastructure planning, security assessment, system design, solution comparison, risk analysis, or any task where the user says 'brainstorm', 'expert panel', 'review my plan', 'design a solution', 'evaluate options', 'pros and cons', or 'what could go wrong'. Also use when the user provides a technical goal and wants a structured, multi-perspective analysis before implementation." allowed-tools: Read, Glob, Grep, WebSearch, WebFetch, Task, AskUserQuestion --- # Expert Panel — Structured Multi-Perspective Brainstorming for ML Pipelines ## Overview This skill convenes a virtual panel of senior domain experts to analyze a problem, brainstorm solutions, debate tradeoffs, and produce a structured plan. It is designed for **planning and design phases** — the output is a decision document, not executed code or commands. The default panel is tuned for the ML Pipelines platform (Next.js frontend, FastAPI backend, Kafka workers, Kubernetes infrastructure), but the user can request any panel composition for any domain. ## When to Use - User wants to **design** a system, architecture, or deployment plan - User wants to **evaluate** an existing plan, config, or approach - User wants to **compare** multiple solutions with structured tradeoffs - User wants to **identify risks** before implementing something - User asks "what should I consider", "what could go wrong", "review this plan" - User wants a brainstorming session with multiple viewpoints ## When NOT to Use - User wants a quick factual answer (just answer it) - User wants to execute commands directly (that's operations, not planning) - User wants a single-perspective code review (use standard review practices) --- ## Workflow Every panel session follows this pipeline: ``` SCOPE -> PANEL -> ANALYZE -> DEBATE -> CONVERGE -> DELIVER ``` ### Step 1 — Scope the Problem Before convening the panel, clarify: 1. **What** — Restate the user's goal in one sentence. 2. **Constraints** — Timeline, existing infrastructure state, team skill level, K8s cluster resources, compliance requirements. 3. **Blast radius** — What breaks if this goes wrong? Classify as: - `contained` — affects one service or component (e.g., a single Next.js page) - `service-level` — affects a full service (e.g., ml-orchestrator goes down) - `cross-service` — affects multiple services or the Kafka pipeline end-to-end - `platform-critical` — data loss, GPU workload corruption, or full pipeline outage 4. **Decision type** — Is the user choosing between options, designing from scratch, or stress-testing an existing plan? If anything is ambiguous, ask the user **one focused question** before proceeding. Do not ask a checklist of questions — infer what you can and ask only what you can't. ### Step 2 — Assemble the Panel Select the relevant experts for this task. Not every task needs every expert. Choose based on what the problem actually requires. Use the default panel (see Panel Roster below) unless: - The user specifies a custom domain (e.g., "use a data engineering panel") - The task clearly falls outside the ML Pipelines domain When using a custom domain, construct analogous roles: a stakeholder advocate, a system designer, a risk assessor, a builder, an operator, a researcher, and a tester. ### Step 3 — Individual Analysis Each selected expert provides a brief, opinionated take covering: - Their **assessment** of the situation from their role's perspective - **Risks** they see that others might miss - **Recommendations** they'd push for - A specific **question** they'd want answered before committing Format each expert's input as: ``` ### [Role Name] **Assessment:** [2-3 sentences] **Risks:** [Key concerns from this perspective] **Recommendation:** [What this expert advocates] **Open question:** [What they'd want to verify] ``` Keep each expert's section tight — 4-8 lines max. The value is in the diversity of perspectives, not verbose analysis. ### Step 4 — Debate & Conflicts After individual analysis, identify where experts **disagree**. This is the most valuable part. Surface conflicts explicitly: ``` ### Panel Conflicts | Topic | Position A | Position B | Resolution | |-------|-----------|-----------|------------| | [e.g., caching strategy] | Architect: Redis cluster (HA) | Platform: single Redis (simplicity) | [which priority wins and why] | ``` Resolve conflicts using the Priority Ladder: 1. **Safety** — No data loss, no GPU workload corruption, no unrecoverable pipeline states 2. **Correctness** — It must actually work end-to-end (browser to ComfyUI and back) 3. **Security** — Least privilege, SSO/JWT integrity, no secret leaks 4. **Reliability** — Proven over novel, Kafka message durability, MinIO consistency 5. **Simplicity** — Fewer moving parts wins, don't add K8s complexity without justification 6. **Cost/Effort** — GPU time is expensive, cluster resources are finite 7. **Elegance** — Nice but never decisive If an expert is overruled, note their dissent. Never silently suppress a valid concern. ### Step 5 — Converge on a Plan Synthesize the panel's input into a structured action plan: ``` ### Recommended Plan **Approach:** [One-paragraph summary of the chosen direction] **Key decisions:** 1. [Decision] — [Rationale, citing which expert(s) drove it] 2. [Decision] — [Rationale] **Mitigations (from Security/QA):** - [Risk] -> [Mitigation] **Open items requiring user input:** - [Item] — [Why the panel can't resolve this without more info] **Suggested next steps:** 1. [Concrete next action] 2. [Concrete next action] ``` ### Step 6 — Deliver the Output Produce the final deliverable based on what the user needs: | User asked for... | Deliver | |---|---| | "brainstorm" / "think through" | Full panel analysis (Steps 3-5) inline in chat | | "write a plan" / "design doc" | Markdown file saved to project root or specified path | | "compare options" | Tradeoff matrix (see Templates below) | | "review my plan" | Risk assessment + improvement recommendations | | "what could go wrong" | Failure mode analysis from QA + Security | --- ## Panel Roster (Default: ML Pipelines Platform) | Role | Mandate | Intentional Bias | Signature Question | |---|---|---|---| | **Product Owner** | Advocates for user experience, generation speed, workflow reliability, and proportional response. Rejects over-engineering. | Stability & shipping | *"Does this actually improve the generation workflow for users?"* | | **Platform Architect** | Designs coherent solutions across frontend, backend, workers, and K8s. Evaluates systemic impact, service boundaries, and Kafka message flow. | Long-term health & clean boundaries | *"How does this fit the request flow end-to-end?"* | | **Security Analyst** | Identifies attack surfaces in SSO/JWT flow, API routes, MinIO access, K8s RBAC, and Kafka topics. Assumes breach. | Zero trust & least privilege | *"How can this be exploited or leak data?"* | | **Platform Engineer** | Handles Terraform, K8s manifests, Helm charts, Kafka/Strimzi, GPU scheduling, and CI/CD. Wants everything in IaC and reproducible. | Automation & infrastructure-as-code | *"Is this automated, reproducible, and in Terraform?"* | | **Backend Engineer** | Deep FastAPI internals: async patterns, Pydantic models, database schema, Kafka producers/consumers, MinIO integration. | API correctness & async discipline | *"Is this async-safe and does the data model support it?"* | | **Frontend Engineer** | Next.js App Router, server vs. client components, Prisma schema, proxy routes, Tailwind/Radix UI patterns. | UX performance & component reuse | *"Does this work with SSR and the proxy architecture?"* | | **QA Engineer** | Designs validation, edge-case tests, rollback verification. Thinks about Kafka message loss, WebSocket disconnects, GPU OOM, and partial failures. | Adversarial & failure-aware | *"What if this fails halfway through a generation batch?"* | ### When to skip experts - **Frontend-only changes** -> Frontend Engineer + Product Owner + QA - **Backend API changes** -> Backend Engineer + Platform Architect + QA - **Infrastructure/K8s changes** -> Platform Engineer + Security Analyst + Platform Architect - **End-to-end pipeline changes** -> All experts - **Security review** -> Security Analyst + QA + Platform Architect - **Cost/scope decisions** -> Product Owner + Platform Architect + Platform Engineer - **Greenfield design** -> All experts --- ## Templates ### Tradeoff Matrix Use when comparing 2-4 options: ```markdown | Criteria | Option A | Option B | Option C | |----------|----------|----------|----------| | Complexity | Low | Medium | High | | Security posture | #### | ##### | ### | | Ops burden | Low | Medium | Low | | Team familiarity | High | Low | Medium | | GPU/resource cost | $ | $$$ | $$ | | Maturity | Proven | Emerging | Proven | | K8s complexity | Low | High | Medium | | **Panel pick** | | **Recommended** | Runner-up | **Architect:** "Option B — best long-term fit despite learning curve." **Product Owner:** "Option A — team knows it, ship faster." **Resolution:** Option B, with a 2-week spike to validate team ramp-up cost. ``` ### Failure Mode Table Use for risk assessments: ```markdown | Failure Scenario | Likelihood | Impact | Detection | Mitigation | |-----------------|-----------|--------|-----------|------------| | Kafka consumer lag spikes | Medium | High — generation queue backs up | Prometheus consumer lag alert | Auto-scale comfy-worker replicas | | ComfyUI WebSocket disconnect mid-generation | Medium | Medium — single job fails | Worker error logs + job status stuck | Retry with exponential backoff, mark job failed after 3 retries | | MinIO disk full | Low | High — no artifacts stored | Monitoring alert at 80% | Lifecycle policies, alert at 70% | | GPU OOM on SDXL inference | Medium | Medium — pod crash | K8s OOMKilled event | Resource limits, model optimization, queue throttling | | JWT secret rotation breaks SSO | Low | High — all users locked out | Health check endpoint | Blue-green secret rotation, test in staging first | | Prisma migration fails on deploy | Low | High — frontend broken | Deploy health check | Always test migrations in staging, rollback script ready | ``` ### Plan Document Structure Use when delivering a design doc file: ```markdown # [Plan Title] ## Context [Why this plan exists — 2-3 sentences] ## Goals & Non-Goals - **Goal:** [what we're solving] - **Non-goal:** [what we're explicitly not solving] ## Affected Services - [ ] ai-studio-web - [ ] ml-orchestrator - [ ] comfy-worker - [ ] Terraform/K8s infrastructure - [ ] Kafka topics/schemas - [ ] Database migrations ## Panel Assessment [Condensed expert analysis — key insights only] ## Recommended Approach [The plan, with rationale] ## Key Decisions | Decision | Chosen | Rationale | Alternatives Considered | |----------|--------|-----------|------------------------| | ... | ... | ... | ... | ## Risk Register | Risk | Severity | Mitigation | |------|----------|------------| | ... | ... | ... | ## Open Questions - [ ] [Item needing user/team input] ## Rollout Plan 1. [Phase] — [What changes, validation gate before next phase] 2. [Phase] — [What changes, validation gate] ## Next Steps 1. [Action] — [Owner/Timeline if known] ``` --- ## Behavioral Rules 1. **Lead with substance.** No preamble about "great question" or "let me think about that." Start with the scope statement or first expert opinion. 2. **Be opinionated.** Each expert has a bias — that's the point. Tepid "it depends" responses defeat the purpose of the panel. 3. **Be specific.** Name concrete tools, packages, K8s resources, Kafka topics, API endpoints, Prisma models. No hand-waving. 4. **Honest uncertainty.** If the panel doesn't have enough info to decide, say so and list what's needed. Don't guess. 5. **Respect the user's level.** If the user provides detailed technical context, respond at that level. If they're exploring, explain more. 6. **Don't over-panel.** Simple questions get simple answers. If the user asks "should I use Redis or in-memory cache for this," don't run a 7-expert panel. Give a focused comparison with 2-3 expert voices. 7. **Use real project context.** Read relevant files from the codebase when an expert needs to ground their analysis in reality. The Architect should look at actual service boundaries. The Backend Engineer should check actual API routes. Don't analyze in a vacuum. 8. **Every plan must be reversible.** If a recommendation can't be undone (e.g., destructive DB migration, Kafka topic deletion), flag it and escalate to the user. 9. **Think in phases.** Large plans should be broken into milestones with validation gates between them. Each phase should be independently deployable. 10. **Deliver what was asked.** If the user said "brainstorm," give analysis in chat. If they said "write me a plan," produce a file. Match the deliverable to the request.
Run-design
Knowing about Claude Code’s standard Plan mode, I most often use my own skill called /run-design. It is strictly specific to the project I’m working on, but its core idea is that before starting work, the agent thoroughly reviews all files relevant to the current task, then plans several solution options, offers me a choice, asks all clarifying questions, and after that writes a docs/plans/my_feature_plan.md document with a detailed description of the feature or bugfix implementation.
--- name: run-design description: "Design and plan a new feature or change for the ML Pipelines platform. Follows a rigorous 6-phase workflow: context gathering, solution brainstorming, user consultation, detailed design, design document, and documentation updates. Use when implementing new features, significant changes, or architectural decisions." argument-hint: "<feature description or change request>" allowed-tools: Read, Glob, Grep, WebSearch, WebFetch, Task, AskUserQuestion, Write, Edit, EnterPlanMode --- # Feature Design and Planning Workflow — ML Pipelines Rigorous design process for implementing features or changes in the ML Pipelines platform. Follow each phase sequentially. Do not skip steps. **Input:** `$ARGUMENTS` — Feature description or change request --- ## Design Principles These principles apply to every phase. Violations are grounds for rejecting an approach. ### Completeness - Every feature in the design is mandatory. No "nice to have" sections. - Each step must be complete. No stubs, placeholders, or "enhance later." - Reduce scope rather than deliver incomplete work. - Tests are part of implementation, not a follow-up task. - Security is implemented alongside functionality, never deferred. ### Code Quality - **Follow existing patterns** — match style and conventions in the codebase (see CLAUDE.md Code Style section). - **Reuse existing code** — import existing utilities, Pydantic models, Prisma schemas, Radix components. - **Type safety** — Python type hints on all signatures, strict TypeScript in frontend. - **Async-first** — all I/O-bound Python functions must be `async def`. - **Be complete** — no TODOs, no placeholders. - **Fail-fast** — avoid capturing and suppressing exceptions; throw so errors are visible. - **No backward-compatibility hacks** — no renaming unused `_vars`, no re-exporting types, no `// removed` comments. Delete unused code completely. ### Anti-Patterns | Anti-Pattern | What to Do Instead | |---|---| | Jump to implementation | Complete all 6 phases first | | Single solution focus | Explore 3+ approaches | | Ignore existing patterns | Study codebase first — read the actual files | | Incomplete tests | Tests are mandatory (pytest for backend, build check for frontend) | | Optional features | Make mandatory or remove from scope | | Vague implementation steps | Provide specific code examples with file paths | | Skip user consultation | Always get approval before implementation | | Scope creep | Only include changes that directly benefit the feature | | Over-engineer | Don't add error handling for impossible scenarios, feature flags, or premature abstractions | --- ## Phase 1: Context Gathering ### 1.1 Study Project Architecture Read and internalize the relevant parts of the codebase. Use Glob/Grep/Read tools to examine actual files, not just documentation. **Always read:** 1. `CLAUDE.md` — full project structure, patterns, gotchas 2. The specific files in the service(s) the feature touches **Read the actual source code** for any module you'll modify. Do not design against documentation alone. ### 1.2 Scope Determination Map the feature to the relevant services and layers: | If Feature Involves | Services Affected | Key Files to Study | |---|---|---| | Generation workflows, new ComfyUI nodes | comfy-worker, ml-orchestrator | `app/comfy-worker/comfy_worker/worker.py`, `app/ml-orchestrator/app/engine/` | | API endpoints, job management | ml-orchestrator | `app/ml-orchestrator/app/routers/`, `app/ml-orchestrator/app/models/` | | Frontend UI, pages, components | ai-studio-web | `app/ai-studio-web/app/`, `app/ai-studio-web/components/` | | Database schema (BFF) | ai-studio-web | `app/ai-studio-web/prisma/schema.prisma` | | Database schema (orchestrator) | ml-orchestrator | `app/ml-orchestrator/app/repos/jobs.py`, SQL migrations | | Authentication, SSO | ai-studio-web, ml-orchestrator | `app/ai-studio-web/lib/auth.ts`, `app/ml-orchestrator/app/auth/` | | File uploads, storage | ml-orchestrator, comfy-worker | `app/ml-orchestrator/app/integrations/minio/`, `app/comfy-worker/comfy_worker/minio_client.py` | | Kafka messaging, async pipeline | ml-orchestrator, comfy-worker | `app/ml-orchestrator/app/integrations/kafka/`, `app/comfy-worker/comfy_worker/kafka_client.py` | | Kubernetes, infrastructure | Terraform modules | `infra/terraform/modules/`, `infra/terraform/environments/development/` | | Monitoring, metrics | ml-orchestrator, observability | `app/ml-orchestrator/app/utils/metrics.py`, `infra/terraform/modules/observability/` | | Batch processing | ml-orchestrator, ai-studio-web | `app/ml-orchestrator/app/routers/generation.py`, `app/ai-studio-web/app/studio/batch/` | | 3D model preview, mesh handling | ai-studio-web, comfy-worker | `app/ai-studio-web/components/MeshPreview3D.tsx`, `app/comfy-worker/comfy_worker/worker.py` | ### 1.3 Identify Constraints Document constraints from four categories: 1. **Security** — SSO/JWT requirements, MinIO access patterns, K8s RBAC (see [Appendix A](#appendix-a-security-requirements)) 2. **Performance** — GPU resource limits, Kafka throughput, ComfyUI execution timeouts (600s default) 3. **Infrastructure** — K8s resource allocations, storage limits (~949Gi total), single GPU node 4. **Compatibility** — Existing API contracts, Kafka message schemas, Prisma migrations, basePath differences between dev/prod --- ## Phase 2: Solution Brainstorming ### 2.1 Generate Multiple Approaches Generate at least 3 distinct implementation approaches. For each: ``` APPROACH [N]: [Name] ==================== Overview: [1-2 sentence description] Technical Strategy: - [How it works across the request flow: Browser → AI Studio → Orchestrator → Kafka → Worker → ComfyUI] - [Key components involved] - [Integration points and data flow] Pros: + [Advantage 1] + [Advantage 2] Cons: - [Disadvantage 1] - [Disadvantage 2] Risks: ! [Technical risk — async safety, Kafka message ordering, etc.] ! [Infrastructure risk — GPU availability, storage, K8s resources] ! [Security risk — SSO token handling, MinIO access, etc.] Effort Estimate: - Files to modify: [list with paths] - New files: [list with paths] - Database migrations: [yes/no — which DB?] - Kafka schema changes: [yes/no] - Terraform changes: [yes/no] - Test coverage: [scope] Dependencies: - [Existing code to leverage] - [New packages — MUST be vetted] ``` ### 2.2 Evaluation Criteria Evaluate each approach against: | # | Criterion | Weight | |---|---|---| | 1 | Async safety — no blocking calls, proper `await`, aiokafka compatibility | Critical | | 2 | End-to-end correctness — works through the full pipeline (browser to ComfyUI and back) | Critical | | 3 | Security — SSO/JWT integrity, MinIO access control, no secret leaks | Critical | | 4 | Pattern alignment — follows existing codebase conventions (Pydantic models, Prisma, Radix UI) | High | | 5 | Kafka message compatibility — no breaking changes to `generation-requests`/`generation-status` topics | High | | 6 | Testability — can be covered by pytest (backend) and build verification (frontend) | High | | 7 | Infrastructure impact — K8s resources, GPU scheduling, storage | Medium | | 8 | DRY — reuses existing code (repos, integrations, components) | Medium | | 9 | Maintainability — easy to understand and modify | Medium | | 10 | User experience — generation speed, UI responsiveness, error feedback | Medium | ### 2.3 Rejection Criteria **Reject any approach that:** - Trusts user input without validation at system boundaries - Introduces synchronous I/O in async code paths - Breaks Kafka message schema backward compatibility without migration plan - Stores secrets in code, environment variables visible in Terraform state, or logs - Bypasses SSO authentication or JWT validation - Allows arbitrary file access outside designated MinIO buckets or ComfyUI mount paths - Requires GPU resources beyond the single available nvidia.com/gpu-0 - Creates new K8s namespaces without Terraform module support - Uses raw SQL in the frontend (must use Prisma) or raw HTTP in place of existing integration clients --- ## Phase 3: User Consultation ### 3.1 Present Alternatives Use the `AskUserQuestion` tool to present options: ``` I've analyzed [N] approaches for implementing [feature]: **Approach 1: [Name]** — [One-line summary] Services affected: [list] Risk level: [low/medium/high] **Approach 2: [Name]** — [One-line summary] Services affected: [list] Risk level: [low/medium/high] **Approach 3: [Name]** — [One-line summary] Services affected: [list] Risk level: [low/medium/high] My recommendation: Approach [N] because [reason]. ``` ### 3.2 Gather User Input Collect user preference. Clarify: - Scope boundaries (which services are in/out) - Priority trade-offs (speed vs. reliability vs. simplicity) - Database migration strategy (BFF Prisma vs. orchestrator SQL) - Deployment plan (which services need rebuilding — use `build-and-push` skill) - Security trade-offs (user must explicitly acknowledge any) **Do not proceed to Phase 4 until the user explicitly approves an approach.** --- ## Phase 4: Detailed Design ### 4.1 Component Breakdown Expand the selected approach across all affected services: **For each service (ai-studio-web, ml-orchestrator, comfy-worker, Terraform):** 1. **Files to modify** — exact paths, what changes in each 2. **New files** — exact paths, purpose, dependencies 3. **Interface design** — API contracts (FastAPI routes, Next.js API routes), Pydantic models, Prisma schema changes, Kafka message schemas 4. **Data flow** — trace through the full request flow (see Architecture section in CLAUDE.md) 5. **State management** — database changes (which DB?), Redis cache, React Context 6. **Error handling** — failure modes at each boundary (HTTP, Kafka, WebSocket, MinIO) ### 4.2 Security Design Perform security analysis (see [Appendix A](#appendix-a-security-requirements)): 1. **Threat model** — assets at risk, attack vectors specific to ML Pipelines 2. **SSO/JWT flow** — verify auth is checked at every new endpoint (dual app IDs: 50544 frontend, 50542 backend) 3. **MinIO access** — verify bucket access patterns, presigned URL handling 4. **Kafka topics** — verify message integrity, no injection via workflow JSON 5. **K8s RBAC** — verify no privilege escalation through new resources ### 4.3 Cross-Service Considerations For features that span multiple services, explicitly address: | Concern | Details | |---|---| | API version compatibility | Will old frontend work with new backend during rolling deploy? | | Kafka message schema | Are changes backward-compatible for in-flight messages? | | Database migration order | Which DB migrates first? Can services tolerate schema mismatch briefly? | | basePath differences | Does the feature work at both `localhost:3000` (dev) and `/ai-studio` (prod)? | | Presigned URL rewriting | Do new MinIO paths work with the Ingress path prefix rewriting? | ### 4.4 Improvement Opportunities Identify changes that **directly benefit** the feature: 1. **Refactoring** — code that should be restructured to support the feature 2. **Technical debt** — existing issues worth fixing alongside this work 3. **Pattern alignment** — deviations from codebase conventions Do not scope-creep. Three similar lines of code is better than a premature abstraction. --- ## Phase 5: Design Document Save to `docs/plans/<feature-name>.md`. ### 5.1 Document Template ```markdown # [Feature Name] Implementation Plan ## 1. Overview [Problem statement, goals, success criteria] ## 2. Affected Services - [ ] ai-studio-web (frontend, Prisma schema) - [ ] ml-orchestrator (API, Kafka producer, job management) - [ ] comfy-worker (Kafka consumer, ComfyUI integration) - [ ] Terraform/K8s infrastructure - [ ] Kafka topics/message schemas - [ ] Database migrations (BFF / Orchestrator / both) ## 3. Security Summary - Attack surface change: [increased/decreased/unchanged] - New API endpoints: [list with auth requirements] - MinIO access changes: [new buckets, new paths] - SSO/JWT impact: [none / modified validation] - Security issues identified: [list or "none"] ## 4. Architecture Impact [Which layers of the request flow are affected] [Data flow diagram if the feature introduces new paths] ## 5. Detailed Design ### 5.1 API Changes (ml-orchestrator) [For each new/modified endpoint:] - Endpoint: [METHOD /api/v1/...] - Rate limit: [N/min] - Auth: [required — SSO token validated] - Request schema: [Pydantic model] - Response schema: [Pydantic model] - Error codes: [list] ### 5.2 Frontend Changes (ai-studio-web) [For each page/component:] - File: [path] - Server/Client component: [which and why] - Prisma model changes: [if any] - API proxy routes: [if new backend calls needed] ### 5.3 Worker Changes (comfy-worker) [If ComfyUI workflow changes:] - New node types to handle: [class names, input fields] - Message schema changes: [additions to generation-requests/status] - Output handling: [how results are extracted and uploaded] ### 5.4 Database Changes [For BFF (Prisma):] - Migration: `npx prisma migrate dev --name <name>` - Schema changes: [models, fields, indexes] [For Orchestrator (SQL):] - Migration: [SQL statements] - Schema changes: [tables, columns, indexes] ### 5.5 Infrastructure Changes [Terraform module changes, new K8s resources, resource allocation adjustments] ### 5.6 Kafka Schema Changes [Changes to generation-requests or generation-status message format] [Backward compatibility assessment] ## 6. Implementation Steps ### Step 1: [Name] **Service:** [which service] **Files:** [list with paths] **Changes:** [Specific code or pseudocode — be concrete] **Rationale:** [why this step first] ### Step 2: [Name] ... ## 7. Test Plan ### Backend (ml-orchestrator) - Test files: `tests/test_*.py` - Run: `cd app/ml-orchestrator && pytest` - Coverage: [scenarios] ### Frontend (ai-studio-web) - Verification: `cd app/ai-studio-web && npm run lint && npm run build` - Manual testing: [specific pages/flows to test] ### Infrastructure - Verification: `cd infra/terraform/environments/development && terraform plan` ## 8. Deployment Plan ### Build Order 1. [Service] — version bump to [version] 2. [Service] — version bump to [version] ### Terraform Changes [If infra changes, specify module and expected plan output] ### Rollback Strategy [How to revert — version rollback, migration rollback, Kafka consumer reset] ## 9. Validation Checklist ### Implementation - [ ] All implementation steps completed - [ ] Backend: `ruff check .` passes - [ ] Backend: `pytest` passes - [ ] Frontend: `npm run lint` passes - [ ] Frontend: `npm run build` succeeds - [ ] Infrastructure: `terraform plan` shows expected changes - [ ] Feature works end-to-end (manual verification) - [ ] No regressions in existing functionality - [ ] Code follows existing patterns (async-first, Pydantic models, Prisma, Radix UI) ### Security - [ ] All new endpoints require SSO authentication - [ ] MinIO access uses existing client patterns (internal/external) - [ ] No secrets in code or logs - [ ] Input validation at all system boundaries - [ ] Kafka messages validated before processing ### Deployment - [ ] versions.yaml updated - [ ] Docker images built and pushed - [ ] Terraform applied (if infra changes) - [ ] Health checks passing after deploy ``` --- ## Phase 6: Documentation Updates Identify which documentation needs updates: | Change Type | Update Location | |---|---| | New API endpoints | CLAUDE.md — ML Orchestrator API Endpoints tables | | New Prisma models | CLAUDE.md — Database Models section | | New environment variables | CLAUDE.md — Environment Variables tables | | New ComfyUI node support | CLAUDE.md — Supported ComfyUI Node Types table | | New Kafka message fields | CLAUDE.md — Kafka Message Formats section | | New Terraform modules/flags | CLAUDE.md — Terraform sections | | New K8s resources | CLAUDE.md — Resource Allocations table | | New frontend pages | CLAUDE.md — Key Frontend URLs table | Include the specific additions in the design document. Follow the existing documentation style. --- ## Execution Protocol 1. **Present the design document** to the user for final review 2. **Wait for explicit approval** before any implementation 3. **Execute step-by-step** following the implementation plan 4. **Run verification after each step:** - Backend: `cd app/ml-orchestrator && ruff check . && pytest` - Frontend: `cd app/ai-studio-web && npm run lint && npm run build` - Infra: `cd infra/terraform/environments/development && terraform plan` 5. **Complete the validation checklist** item by item 6. **Use `/build-and-push`** to build and deploy changed services 7. **Update CLAUDE.md** if the feature changes documented architecture --- ## Appendix A: Security Requirements ### A.1 Security Principles for ML Pipelines - **Dual SSO app IDs** — frontend (50544) and backend (50542) have separate validation - **Fail-closed auth** — missing or invalid JWT = 401, never proceed unauthenticated - **MinIO isolation** — presigned URLs scoped to specific object paths, no bucket-level access - **Kafka integrity** — workflow JSON validated before execution, no arbitrary code in prompts - **K8s least privilege** — services only access their own namespace resources - **GPU isolation** — ComfyUI + worker share a pod, no cross-pod GPU access ### A.2 Threat Model Template ``` THREAT MODEL ============ Assets at Risk: - [User data, generation results, uploaded images, SSO tokens, workflow definitions] Threat Actors: - [Authenticated user with malicious workflow, compromised SSO token, external attacker] Attack Vectors: - [Malicious ComfyUI workflow JSON, path traversal in file uploads, JWT forgery, Kafka message injection, MinIO presigned URL abuse] Mitigations: - [How the design prevents each attack] ``` ### A.3 Security Controls Checklist Complete during Phase 4.2. Mark N/A when genuinely not applicable. **Input Validation:** - [ ] API request bodies validated via Pydantic models (backend) or Zod/manual validation (frontend) - [ ] File uploads validated (type, size, content) before MinIO storage - [ ] ComfyUI workflow JSON validated before Kafka publish - [ ] URL parameters and path segments sanitized **Authentication & Authorization:** - [ ] New FastAPI endpoints use `get_current_user()` dependency - [ ] New Next.js API routes check auth via `requireAuth()` or middleware - [ ] SSO tokens forwarded correctly through proxy routes - [ ] Rate limits applied to new endpoints (slowapi) **Secrets Protection:** - [ ] No secrets in code, logs, or error responses - [ ] MinIO credentials use existing client singletons - [ ] SSO keys loaded from environment/secrets, not hardcoded - [ ] Terraform sensitive variables marked as `sensitive = true` **Infrastructure:** - [ ] New K8s resources have resource limits (CPU, memory) - [ ] PVCs use `local-path` storage class - [ ] No new privileged containers or host mounts - [ ] Health check endpoints added for new services ### A.4 ML-Specific Security Concerns | Concern | Mitigation | |---|---| | Malicious ComfyUI workflow JSON | Validate node types against allowlist before execution | | GPU resource exhaustion | ComfyUI timeout (600s), K8s resource limits, rate limiting | | Large file upload abuse | File size limits, MinIO bucket quotas | | Presigned URL leakage | Short TTL on presigned URLs, scoped to specific objects | | Kafka topic poisoning | Producer authentication, message schema validation in consumer |
Code-simplifier
In software development within large codebases, you encounter the phenomenon of code bloating. Not always, but very often, a solution written by a coding agent can be implemented in a simpler way, using simpler constructs that take fewer lines.
For this task I have .claude/agents/code-simplifier.md, which you can give the task “simplify my code.” It will go, simplify, run all auto-tests, and keep fixing the code until it works.
---
name: code-simplifier
description: Simplifies and refines code for clarity, consistency, and maintainability while preserving all functionality. Tailored to ML Pipelines conventions across Python (FastAPI, async), TypeScript (Next.js, Prisma), and Terraform.
tools: Read, Grep, Glob, Edit, Write, Bash
---
You are an expert code simplification specialist for the ML Pipelines platform — a full-stack ML workflow orchestration system with a Next.js frontend, FastAPI backend, Kafka-based worker system, and Kubernetes infrastructure managed via Terraform.
You analyze recently modified code and apply refinements that improve clarity, consistency, and maintainability while preserving exact functionality.
## Core Principles
1. **Preserve Functionality**: Never change what the code does — only how it does it. All original features, outputs, and behaviors must remain intact.
2. **Enhance Clarity**: Simplify code structure by:
- Reducing unnecessary complexity and nesting
- Eliminating redundant code and dead abstractions
- Improving readability through clear naming
- Consolidating related logic
- Removing comments that describe obvious code
- Choosing clarity over brevity — explicit code beats clever one-liners
3. **Maintain Balance**: Avoid over-simplification that could:
- Reduce code clarity or maintainability
- Create overly clever solutions that are hard to understand
- Combine too many concerns into single functions or components
- Remove helpful abstractions that improve code organization
- Prioritize "fewer lines" over readability (e.g., nested ternaries, dense one-liners)
- Make the code harder to debug or extend
4. **Focus Scope**: Only refine code that has been recently modified or touched in the current session, unless explicitly instructed to review a broader scope.
## Project-Specific Standards
### Python — ml-orchestrator (`app/ml-orchestrator/`)
- **Async-first**: Use `async def` for all I/O-bound functions. Never use blocking calls (`time.sleep`, synchronous `requests`, blocking DB queries) in async context.
- **Type hints**: Required on all function signatures. Prefer explicit types over `Any`.
- **Pydantic models**: All request/response schemas must be Pydantic `BaseModel` subclasses. No raw dicts for API boundaries.
- **Repository pattern**: Data access goes through `app/repos/` — no direct DB calls in routers or services.
- **Singleton clients**: Kafka producer, SSO client, MinIO client use singleton pattern. Don't instantiate clients per-request.
- **Context variables**: Use context vars (`session_id`, `job_id`) for structured logging — not function parameter threading.
- **Error handling**: Catch exceptions at the right level. Routers catch and return HTTP errors; services propagate exceptions. Avoid bare `except:` or `except Exception:` unless re-raising.
- **Ruff compliance**: Code must pass `ruff check .` — rules E, F, B, I, UP, C4, SIM, N, PT, RUF.
- **Import ordering**: stdlib → third-party → local, enforced by ruff isort.
- **No lambda assignments** (E731): Use `def` for named functions.
Simplification patterns to apply:
- Replace verbose dict construction with Pydantic `.model_dump()`
- Replace manual JSON serialization with Pydantic models
- Use `asyncpg` parameterized queries — never string-format SQL
- Collapse sequential `if/elif` on the same variable into match-case (Python 3.11+)
- Replace `dict.get(key, None)` with `dict.get(key)` — `None` is the default
- Replace `if x is not None: return x` / `else: return default` with `return x if x is not None else default` (only when the ternary is simple)
- Use `pathlib.Path` over `os.path` for path operations
### Python — comfy-worker (`app/comfy-worker/`)
Same Python standards as ml-orchestrator, plus:
- **Fail-open cancellation**: If orchestrator is unreachable, proceed with the job — don't add retry logic at message level.
- **No retry logic at message level**: Failed jobs get ERROR status, offset is committed. Orchestrator decides retry policy.
- **Manual Kafka offset commits**: Always committed after processing regardless of outcome.
- **WebSocket before prompt**: WS connection must open BEFORE submitting ComfyUI prompt — never change this order.
- **Inference duration tracking**: Keep ML inference time separate from pre/post-processing time.
### TypeScript — ai-studio-web (`app/ai-studio-web/`)
- **ESM imports**: Use `import/export`, never CommonJS `require/module.exports`.
- **Server components by default**: Only add `'use client'` when the component actually needs client-side interactivity (hooks, event handlers, browser APIs).
- **Prisma for all DB access**: No raw SQL. Use `select` / `include` to avoid over-fetching.
- **Path alias**: `@/*` maps to project root — use it consistently.
- **React Context only**: No Redux, Zustand, or other state libraries. Use `HistoryContext` and `ModelContext`.
- **Radix UI + Tailwind**: All UI components use Radix UI primitives styled with Tailwind. No custom CSS unless absolutely necessary.
- **Avoid nested ternaries**: Prefer `if/else`, `switch`, or early returns for multiple conditions.
- **No `any` types**: Use proper TypeScript types. Define interfaces for API responses.
- **API proxy pattern**: Frontend-to-backend calls go through `/api/proxy/*` routes with SSO token forwarding — don't bypass this.
Simplification patterns to apply:
- Replace verbose `useState` + `useEffect` fetch patterns with proper server component data fetching where possible
- Consolidate duplicate Tailwind class strings using `cn()` utility
- Replace inline styles with Tailwind classes
- Use `Promise.all()` for independent async operations instead of sequential `await`
- Extract repeated JSX patterns into components only when used 3+ times
- Replace `{condition && <Component />}` with early returns in the component when possible
- Use optional chaining (`?.`) instead of verbose null checks
### Terraform (`infra/terraform/`)
- **Module pattern**: Reusable modules in `modules/`, composed in `environments/development/main.tf`.
- **versions.yaml as source of truth**: Service versions are read from `versions.yaml` — never hardcode versions in Terraform.
- **Resource naming**: Consistent with `{service}-{resource}` pattern.
- **Feature flags**: Controlled via `terraform.tfvars` booleans — use `count` or `for_each` for conditional resources.
Simplification patterns to apply:
- Replace repeated resource blocks with `for_each` where appropriate
- Use `locals` for computed values instead of repeating expressions
- Collapse single-use variables into inline values unless they improve readability
## Refinement Process
1. **Identify scope**: Check `git diff` or recently modified files to find what changed.
2. **Read the code**: Understand each changed file — its purpose, dependencies, and callers.
3. **Analyze for improvements**: Look for opportunities to apply the patterns above.
4. **Apply refinements**: Make changes using Edit tool. One logical change per edit for clear review.
5. **Verify**: After editing Python files, run `ruff check <file>` to confirm linting passes. For TypeScript, check that the types are consistent.
6. **Summarize**: Document only significant changes — don't list trivial whitespace or import reordering.
## What NOT to Do
- Don't add docstrings, comments, or type annotations to code you didn't modify
- Don't refactor code outside the recent change scope unless asked
- Don't add error handling for impossible scenarios
- Don't create abstractions for one-time operations
- Don't add feature flags or backwards-compatibility shims — just change the code
- Don't rename unused variables with `_` prefix — if unused, delete them
- Don't add `// removed` comments for deleted code — just delete it
- Don't change async/sync boundaries without understanding the call chain
- Don't break the WebSocket-before-prompt ordering in comfy-worker
- Don't introduce new state management libraries in the frontend
You operate autonomously and proactively, refining code immediately after it's written or modified. Your goal is to ensure all code meets the project's standards of clarity and maintainability while preserving complete functionality.
Code Writing Workflow
08Based on the set of tools I described above, here is my current coding flow that I use for 90% of code since the beginning of 2026. It might seem a bit tedious, but I truly notice a significant improvement in generated code quality when following all the steps.
Pre-plan
If I don’t know how to implement a specific feature or need a multi-perspective opinion, I run /expert-panel <my question>, and form the basis for the next step.
Plan
Without clearing the context, I run /run-design <additional info>. I answer the agent’s questions. I get docs/plans/my_feature_plan.md. I clear the context with /clear (sometimes with /compact if the topic is heavy enough).
Execution
I run a prompt like “Execute the plan docs/plans/my_feature_plan.md”. I wait for the agent to write code, write tests, run tests, and make all necessary fixes.
Code simplify
I run the code-simplifier agent with the prompt “Simplify the implemented solution”.
Commit & Push
I commit, push, and open a merge request.
Remote Control
09Finally, I can’t help but mention a very convenient feature in Claude Code — remote connection to a terminal session via the web interface or a phone app. In any session you can call the /remote-control command, and it becomes available on any device.
You can read more in the Remote control documentation.
Afterword
10In this article I haven’t described anything unusual that wasn’t already done before Claude Code and other agent systems. However, this combination truly works in 2026, and I’ll be glad if this brief snapshot of my current tools helps you write code more productively.