Open-source deployments often succeed technically and fail operationally. We break down the most common failure modes and what production-ready infrastructure actually requires beyond the initial deployment.
Self-hosted deployments fail in a predictable pattern. The initial deployment goes well — the system is running, users are onboarded, the team is satisfied. Then, six to twelve months later, something goes wrong.
The Typical Failure Sequence
An upgrade introduces a breaking change that nobody anticipated because there was no upgrade testing process. A disk fills up because monitoring wasn't configured. A backup fails silently because backup verification was never implemented. A key employee leaves and takes undocumented institutional knowledge with them.
Each of these failures has the same root cause: the deployment was treated as a project with a completion date rather than an operational responsibility with ongoing requirements.
What Production-Ready Actually Means
The word "production-ready" is used loosely. In practice, a production deployment has several distinct requirements beyond functional software:
Monitoring and alerting: The system needs to report its own health — resource utilization, error rates, response times, and service availability — to an external monitoring system. Alerts need to go somewhere actionable when thresholds are crossed.
Backup with verification: Data backups need to run on a schedule, write to a separate environment from the primary system, and be verified regularly. An untested backup is not a backup.
Upgrade management: Open-source software releases security patches and version updates continuously. A production system needs a defined process for evaluating, testing, and applying these updates on a schedule that doesn't expose the organization to known vulnerabilities.
Runbook documentation: Someone needs to be able to respond to an incident at 2am without requiring tribal knowledge. This means documented procedures for common failure modes, escalation paths, and recovery steps.
Access management: Production credentials need to be managed in a secrets store, not in configuration files, chat history, or a single person's memory.
Incident response: There needs to be a defined process for what happens when something breaks — who gets alerted, how severity is assessed, what the recovery steps are, and how the incident gets documented.
The Handoff Gap
Most self-hosted deployment failures happen in what we call the handoff gap: the space between a functional deployment and an operationally sustainable system.
Agencies and consultants often deliver into this gap — a working system with minimal operational documentation, no monitoring, and the assumption that the client's team will figure out operations. Internal projects often fall into it too, with the developer who built the system moving on to other work before operations processes are established.
The result is a system that runs fine until it doesn't, with no early warning and no defined response.
Closing the Gap
Closing the handoff gap requires treating operations as a deliverable, not an afterthought. This means:
- —Monitoring and alerting configured before the system goes live
- —Backup procedures implemented and verified during deployment
- —Upgrade processes documented and tested on a staging environment
- —Runbooks written for the failure modes most likely to occur
- —Access management configured with offboarding in mind from day one
At TrySelfHost, ongoing operations are part of the engagement from the start. We don't complete a deployment and move on — we assume operational responsibility for the systems we build, which means the handoff gap doesn't exist. The same team that deployed the system is accountable for running it.
This is, in our view, the only responsible way to deliver self-hosted infrastructure to organizations that don't have dedicated operations staff.
Related: how we close the handoff gap
Our Ongoing Infrastructure Operations engagement is specifically designed to prevent the failure pattern described here. The Production-Ready Deployment standard sets the baseline every system must meet before we consider the deployment complete.