Every team I’ve joined has its own version of the same problem: deployments that work, but only the person who built them knows why. At NovumDMS that meant an AgriTech/IoT platform, a multi-tenant SaaS ERP layer and a distribution ERP — each with its own scripts, its own conventions, and no shared notion of “what is actually running in production right now.”
The goal wasn’t ArgoCD for its own sake. It was answering three questions instantly, for any environment: what’s deployed, who changed it, and how do we roll it back? GitOps gives you all three for free, as long as you respect the model.
Start with the repository layout, not the tool
The most important decision happens before you install anything. We separated application code from deployment manifests — app repos build and push images; a dedicated config repo describes desired cluster state. ArgoCD watches the config repo and reconciles.
# config repo — one source of truth per env
clusters/
agritech-prod/ apps/ values/
erp-saas-prod/ apps/ values/
distrib-prod/ apps/ values/
_base/ # shared kustomize bases
The _base directory is where the real win lives. Four products that used to drift apart now share Helm/Kustomize bases — so a change to, say, the standard probe configuration happens once.
App-of-apps, so onboarding is one PR
Rather than registering each application by hand, we used the app-of-apps pattern: a single root Application points at a directory of child Applications. Adding a service to an environment becomes a one-file pull request, reviewed like any other change.
If adding a service to production isn’t a reviewable pull request, you don’t have GitOps — you have a wiki page that lies.
Migrating without downtime
The trick to a zero-downtime cutover is that ArgoCD can adopt resources it didn’t create. The sequence we ran for each service:
- Export the live manifests and commit them exactly as they are to the config repo.
- Create the ArgoCD Application in
manual syncmode and let it report Synced / Healthy against the running workload — no changes applied. - Diff obsessively. If ArgoCD wants to change anything, the committed manifests are wrong, not the cluster.
- Only once the diff is empty do you flip to automated sync with self-heal.
Note: Resist the urge to “clean up” manifests during the migration. Adopt first, refactor later — in a separate PR you can actually reason about.
Secrets without leaking them into Git
GitOps and secrets feel contradictory until you encrypt at rest. We used SOPS + AGE: secrets live in the repo encrypted, decrypted in-cluster at apply time. Auditable, reviewable, and no plaintext ever touches a commit.
What I’d do differently
- Define sync windows earlier. Self-heal is great until it fights a 2am hotfix. Agree on windows before, not after.
- Invest in notifications on day one. A failed sync that nobody sees is worse than the manual process you replaced.
- Treat the config repo’s CODEOWNERS as production access control. It is.
Three months in, “what’s running in prod?” is answered by a git log instead of a Slack thread. That’s the whole point.