misterGF 📟

The Problem

I run a cloud-native distributed filesystem that multiple teams rely on daily. We're talking ~50 terabytes of active production data. The data lives in a vendor's cloud - not mine. If something breaks, I need my own copy sitting in object storage. Simple enough requirement, right?

The First Attempt: Reaching for the "Managed" Option

My first instinct was the managed service route. You know the one - purpose-built for moving data between storage systems, backed by a cloud provider, theoretically "just works."

I wrote the automation. Activation keys, IAM role plumbing, service principal policies, SSM parameter handshakes. The whole ceremony.

But it had friction:

Cross-account complexity: The agent needed broad permissions that made security reviews uncomfortable.
Operational overhead: Essentially a VM-in-a-VM model designed for hybrid/on-prem migrations. I was running it on the same instance that hosted my filesystem client. Sledgehammer, picture frame.
Limited observability: Getting transfer metrics required polling a separate API. I wanted something closer to the metal.

The role sat commented out in my playbook for weeks. I knew what that meant. I just didn't want to admit the sunk cost yet.

The Replacement: Just a Copy Command on a Timer

Eventually I ripped the bandaid off. In a single commit - roughly 400 lines added, 400 removed - I:

Deleted the entire managed service automation (300+ lines, gone, bye DataSync).
Created a new role with: a systemd timer firing daily at 2 AM with randomized delay, a oneshot service running a sync command, config templates using IAM instance role auth, and an exclude list for filesystem metadata.
Tightened the bucket policy - replaced a broad root principal with a specific IAM role, split bucket-level vs. object-level permissions.

That was it. A RClone copy command, on a timer, with some config. The kind of thing that feels too simple to be the answer. Spoiler: it was the answer. But getting it production-ready? That took months of annoying little discoveries.

The Hardening

Death by a Thousand Papercuts

What followed was months of iterative refinement. Every time I thought "okay, this is done," something new would surface. A reboot at the wrong time. A permission denied on a file that existed yesterday. An alarm that wouldn't clear. Each one small enough to feel trivial, but collectively they ate weeks.

Permissions

Some files on the mount are locked down. I spent an embarrassing amount of time debugging transfer failures before realizing certain files were root:root with no group read. I eventually had to run the service as root with a dedicated access group - a pragmatic compromise I documented directly in the service file. You write the comment so future-you doesn't undo it in a fit of "why is this running as root?"

Timer Edge Cases

I discovered that combining OnCalendar=daily with Persistent=true could trigger duplicate runs after a reboot near the scheduled time. This one took longer to diagnose than I'd like to admit - the logs looked normal, but my transfer counts were doubling on days the instance recycled. Fix: drop the redundant directive, rely on the explicit schedule only.

Monitoring (The Iterative Part)

This is the one that really tested my patience. Monitoring evolved in three stages, each driven by a real question that only surfaced after the previous answer wasn't enough:

Stage 1 - "Did it run?"
A single heartbeat metric published as a post-transfer hook:

aws cloudwatch put-metric-data --namespace Monitoring --metric-name LastSuccess --value 1 --dimensions InstanceId=$(curl -s http://169.254.169.254/latest/meta-data/instance-id)

An alarm fires if this metric goes missing for 25 hours (24h schedule + random delay + drift). Great, I thought. Done.

Stage 2 - "Why is the alarm still firing?"
It wasn't. But it also wasn't clearing. I had a single 25-hour evaluation period, which meant even after a successful run, the alarm sat in ALARM state for up to a full day before resolving. Switched to 25 x 1-hour periods. Now it clears within an hour. The kind of fix that takes 30 seconds to implement and two days to figure out you need.

Stage 3 - "What actually happened?"
A small shell script parses the tool's stats output and publishes six metrics in one API call:

LastSuccess - heartbeat for the alarm
FilesTransferred - count per run
BytesTransferred - volume per run
Checks - files compared but not transferred
TransferErrors - partial failures (locked files, encoding issues)
ElapsedSeconds - wall-clock duration

I should have started here. But you never know what questions you'll need to answer until you realize it at an inconvenient hour.

sync to copy

This one stung. I was running sync mode (mirror behavior - deletes files in destination that don't exist in source). Which is fine until someone accidentally removes a directory from the source mount and your "backup" dutifully removes it too. Switched to copy (additive only). I'd rather have stale files in object storage than accidentally delete a backup. Boring? Yes. The thing that lets me sleep? Also yes.

Supply Chain Hygiene

I stopped downloading packages from the internet at install time. Instead: mirror to private storage, pin the version, verify SHA-256 checksums, fail loudly on mismatch. Adds friction to upgrades but eliminates "the internet changed under us" failures.

Config as Data

All tuning knobs moved into Ansible inventory group vars. Beta gets conservative settings, production gets aggressive ones. No template changes required to switch between them.

What I Learned

Managed services aren't always the right abstraction. For a daily incremental backup of a mounted filesystem, the managed service was over-engineered. A copy command on a timer gave me a simpler mental model and fewer moving parts. Even at this large scale.

Systemd is an underrated orchestration layer. Timers with Persistent=true and RandomizedDelaySec, oneshot services with pre/post hooks, nuanced exit code handling, memory caps and CPU quotas - all without a container runtime.

Observability doesn't have to start complex - but it will get there. Mine grew from nothing to a single heartbeat to six granular metrics. Each step answered a specific operational question. Start with "did it run?" and expand when reality demands it.

The hardening phase is the actual project. The initial replacement took a day. Making it production-solid took months. Every "simple" infrastructure project I've worked on follows this pattern, and I keep being surprised by it.

Supply chain security is a practice. Pin versions. Mirror packages. Verify checksums. It's not exciting, but it removes an entire class of surprises.

The best infrastructure decisions look boring in retrospect. A well-configured open-source tool, wrapped in systemd and monitored through a few metrics, isn't winning any architecture awards. But it runs every night, tells me exactly what happened, and when something breaks, I know within the hour. That's the kind of boring I'm proud of.