Monday, March 2, 2026

Building a NAS With AI: What Claude Got Right, Wrong, and Hilariously Confused About

Custom 3D-printed NAS enclosure with ZimaBlade and dual hard drives

Last month, I built a home NAS infrastructure from scratch: two TrueNAS servers, ZFS mirrored pools, automated replication, encrypted cloud backups, 28 monitoring checks, and a failover system that can switch servers in five minutes. The whole thing — design, documentation, configuration, migration — was done in collaboration with an AI coding assistant.

It was genuinely, surprisingly useful. It also tried to make my backup server unwritable, which would have broken the entire replication chain. So let's talk about what actually happened.

What I Built (and Why)

I wanted my family's data — photos, documents, project files — on infrastructure I control. Not on Google Drive, not on iCloud, not on any service that can change its terms, raise its prices, or hand my data to a government I didn't elect. That ruled out cloud-only storage. It also made me skeptical of vendor-locked NAS solutions like Synology, where you're one firmware update away from features disappearing or subscriptions appearing.

So I built it myself, using two small single-board servers (a ZimaBlade 7700 and a ZimaBoard 832) running TrueNAS — open-source, ZFS-based, no license fees. I already had one of the boards and one hard drive from a previous setup. The new hardware — a ZimaBlade, three 4TB drives — came to around €400. That's less than a single Synology DS224+ with equivalent drives, and I got two fully redundant NAS units out of it. A single Synology gives you one copy of your data in one location — you'd need a second unit plus a cloud subscription to achieve 3-2-1. I got both local copies covered for less than most people spend on their first NAS.

Both boards are tiny — credit-card-sized — so off-the-shelf cases weren't an option. I designed custom enclosures in FreeCAD that house a board and two 3.5" drives each, and printed them on my 3D printer. It's not the prettiest rack in the world, but it fits neatly in a utility closet and keeps everything ventilated.

The project follows the 3-2-1 backup rule: three copies of data, two different media, one off-site. Alpha is the primary NAS serving files via SMB and NFS to a Kubernetes cluster. Beta receives ZFS replication every six hours. Beta then uploads encrypted backups to a European cloud provider daily. Three copies, two locations, one off-site — with the off-site copy encrypted before it leaves my network.

There are 16 Architecture Decision Records documenting choices like "why are the pools unencrypted?" and "why does cloud backup run from Beta instead of Alpha?" There are 24 Standard Operating Procedures covering everything from drive replacement to disaster recovery. An 11-phase migration plan moved ~565 GB from the old NAS to the new setup.

All of that was produced in conversation with Claude Code, Anthropic's AI coding assistant. Not by typing prompts into a chat window — by working in a terminal where the AI could read files, run commands, SSH into the NAS boxes, and execute ZFS operations directly.

What AI Got Right

Documentation was the killer feature. I'm an engineer who knows what he wants but doesn't love writing 24 SOPs. Each procedure has YAML frontmatter (category, trigger, risk level, approval requirements), numbered steps, verification checks, and rollback instructions. Claude generated these from our conversations about how each operation should work. I described the intent, reviewed the output, and corrected the details. What would have taken me a weekend of reluctant writing happened naturally as a byproduct of the design process.

The Architecture Decision Records were similar. I'd explain a trade-off — "should I encrypt the ZFS pools or just encrypt the cloud backup?" — and get back a structured ADR with considered options, pros/cons, and a clear decision rationale. Sixteen of these, each capturing a decision I'd otherwise have kept in my head and forgotten the reasoning for six months later.

Architecture review caught real gaps. During one session, Claude pointed out that my monitoring system (Uptime Kuma, running on the Kubernetes cluster) depends on NFS storage from the very NAS it's monitoring. If Alpha dies, Kubernetes loses storage, Uptime Kuma goes down, and nobody gets alerted. I knew this intellectually, but having it surfaced during design — not during a 2 AM outage — meant I could add TrueNAS native email alerts as a fallback layer before it mattered.

Another catch: Beta wasn't actually ready for failover. The services and snapshot tasks hadn't been pre-configured. It would have taken 30-60 minutes of manual configuration during an actual failure. Claude flagged this during verification, we pre-configured everything in disabled state, and failover time dropped to five minutes.

Migration execution was where things got wild. Claude had SSH access to both NAS units via MCP (Model Context Protocol) servers. It could run zpool status, create datasets, transfer data, configure NFS exports, and verify replication — all while tracking progress in a migration document. Eleven phases — hardware build over a week, then software configuration through migration in a single intense day. The AI handled the tedious parts (creating 27 ZFS datasets with consistent properties, generating 60+ NFS subdirectory exports) while I made the judgment calls (when to cut over, whether the NFS mount errors were acceptable).

What AI Got Wrong

Here's where it gets interesting.

The readonly incident. During failover procedure design, Claude suggested setting zfs set readonly=on on Beta's datasets to "protect" the replication targets from accidental writes. Sounds reasonable, right? ZFS readonly blocks all writes — including zfs recv, the command that receives replication data. If I'd applied this, Beta would have silently stopped accepting replicas while looking perfectly healthy. My backup copy would have grown staler by the day with no alerts.

This is the kind of mistake that's terrifying precisely because it's plausible. An engineer who doesn't know ZFS internals might accept that suggestion. The fix was simple (Beta's write protection comes from not having active SMB/NFS shares, not from a ZFS property), but finding this in production instead of in review would have been ugly.

The Time Machine saga. Claude helped set up a Time Machine dataset on the NAS for Mac backups, complete with snapshot tasks. Then I realized a local USB drive was simpler for my one-laptop setup. Removing it took three fix commits: removing the snapshot tasks, destroying the dataset, and then chasing down Time Machine references scattered across multiple SOPs and documentation files. The AI was solving the general problem ("Mac users need backups") instead of my specific problem ("one Mac laptop that's already backed up to iCloud").

Ghost references from training data. Nextcloud kept appearing in suggestions and documentation, even though I never planned to run Nextcloud. Claude's training data is full of home lab setups that use it, so it kept assuming I would too. Similarly, I'd occasionally get suggestions referencing FreeBSD behavior — reasonable for older TrueNAS versions, but wrong for TrueNAS 25.10 which is Debian-based. The NFS subdirectory export issue during migration (75 minutes of unexpected downtime) was partly because the AI initially suggested approaches that work on FreeBSD but not on Linux.

Over-engineering suggestions. An AI assistant has no sense of "this is a home lab with two users." It treats your project with the same architectural rigor it would apply to a production system serving thousands. I had to repeatedly push back on suggestions for automated failover (I have two NAS boxes in the same house — I can walk to them), complex monitoring dashboards (Uptime Kuma is fine), and elaborate access control (it's my family's files).

The Guardrails That Saved Me

After the readonly incident, I got serious about constraints.

A safety rules file lives in the repository: never run zfs destroy without approval, never modify Beta's datasets directly, never change the primary NAS IP without coordinating with the Kubernetes cluster. Claude reads this file at the start of every conversation and respects it. It's the equivalent of putting "DANGER: HIGH VOLTAGE" signs on the electrical panel — obvious to humans, genuinely useful for AI.

ADRs as guardrails, not just documentation. The instruction "before proposing changes, check whether an ADR covers that area" turns 16 decision records into active constraints. When Claude suggests encrypting the ZFS pools, ADR-0001 stops it. When it suggests running cloud backup from Alpha, ADR-0002 stops it. The decisions compound — each one narrows the space of bad suggestions.

A memory system records mistakes and lessons across conversations. The readonly bug is in there. The Time Machine saga is in there. "CRC errors on Beta sdb suggest a possible SATA cable issue" is in there. This means the AI doesn't repeat known mistakes and carries forward context that would otherwise be lost between sessions.

Explicit approval gates for anything destructive. The AI can read pool status and list snapshots all day long, but it cannot destroy a dataset or roll back a snapshot without me typing "yes." This isn't a technical limitation — it's a convention enforced by the safety rules and, honestly, by me paying attention.

Would I Do It Again?

Without hesitation. But with different expectations than when I started.

AI is excellent at:

  • Documentation — generating structured, consistent docs from conversational design sessions
  • Consistency checking — finding mismatches between SOPs, ADRs, and README sections
  • Tedious execution — creating 27 datasets, 60 NFS exports, 28 monitoring checks without typos
  • Gap detection — "you have a failover procedure but no failback procedure" is exactly the kind of thing humans miss

AI is unreliable at:

  • Platform-specific behavior — ZFS on FreeBSD vs. Linux, TrueNAS GUI limitations, hardware quirks
  • Knowing when to stop — it will happily over-engineer a home lab into a production data center
  • Physical context — it doesn't know your NAS boxes are in the same house, your only Mac is already backed up, or that you have two users not two thousand

The key insight I keep coming back to: AI doesn't replace knowing your system — it amplifies what you already know. I understood ZFS, the 3-2-1 backup strategy, and what my family actually needs from a NAS. Claude helped me document that understanding, catch my blind spots, and execute the boring parts at speed. When it suggested something wrong, I caught it because I understood the domain.

If I'd been a complete beginner trusting the AI to design my backup strategy, the readonly bug would have made it to production. The Time Machine detour would have stayed. The FreeBSD assumptions would have caused more than 75 minutes of downtime.

The best use of AI in infrastructure isn't "build this for me." It's "I know what I want — help me document it, verify it, and catch what I missed." That framing kept the project on track through 82 commits, 16 decisions, 24 procedures, and one very educational mistake about ZFS write semantics.

No comments:

Post a Comment

My Washing Machine Picks Its Own Schedule (and Saves Money)

My electricity price changes every hour. Some hours it’s 0.05 EUR/kWh. Other hours it’s 0.38. The difference between running the dishwasher ...