Best Practices for On Call Teams
Getting paged at 3 AM isn't anyone's idea of a good time. But someone needs to be there when production goes down, APIs start throwing 500 errors, or a critical service stops responding. That's the reality of modern software operations.
On-call work has become table stakes for tech teams. Users expect services to be available around the clock, and when something breaks, they want it fixed immediately. This creates unique challenges for engineering teams who need to balance rapid incident response with the health and wellbeing of their people.
The problem? Most organizations approach on-call as an afterthought. They throw together a rotation schedule, distribute some pager credentials, and hope for the best. Then they wonder why engineers are burning out, incidents take forever to resolve, and the whole system feels like it's held together with duct tape.
There's a better way. Building an effective on-call practice requires intentional planning, clear processes, and a culture that values both reliability and human sustainability. This article covers the core practices that separate functional on-call teams from dysfunctional ones.
Table of contents
- Why on-call matters for modern teams
- Technical setup and infrastructure
- Designing sustainable rotation schedules
- Setting clear responsibilities and boundaries
- Building a supportive on-call culture
- Managing alerts and reducing noise
- Handoff procedures and documentation
- Time management and workload balance
- Preventing burnout in on-call teams
- Continuous improvement through postmortems
- Compensation and recognition
- Monitoring and tracking on-call health
Why on-call matters for modern teams
The shift to 24/7 operations didn't happen overnight. It crept up on the industry as SaaS products replaced on-premise software and global user bases demanded constant availability. A outage at 2 AM Pacific time hits users in Europe during their workday.
Companies that get on-call right see measurable benefits. Faster incident response means less downtime and happier customers. Engineers who own their code in production write better code. They think twice before shipping that questionable database migration at 4 PM on Friday.
But the inverse is also true. Poor on-call practices lead to high turnover, degraded system reliability, and a culture where nobody wants to take ownership. Engineers start viewing on-call as punishment rather than a normal part of operations work.
The key insight: on-call is both a technical problem and a people problem. You need solid infrastructure and tooling, but you also need processes that respect human limits and create psychological safety.
Technical setup and infrastructure
Before putting anyone on-call, the technical foundations need to be solid. This means having the right equipment, access controls, and alerting platforms in place.
Equipment and connectivity
On-call engineers need reliable ways to receive and respond to alerts. A company-provided laptop is standard, but mobile access matters more. Most incidents get acknowledged from a phone, not a computer.
Mobile internet connectivity is non-negotiable. If someone needs to respond from their home internet connection, they're severely limited in where they can be during their on-call shift. A company-sponsored mobile data plan removes that constraint.
Some teams provide dedicated on-call phones that get passed between engineers. This has pros and cons. It creates a clear separation between work and personal devices, but it also means carrying two phones and potentially missing alerts if the on-call phone is in another room.
Access and permissions
Nothing is worse than getting paged about a production issue and then discovering you can't actually fix it because you lack the necessary permissions. On-call engineers need access to:
- Production systems and databases
- Cloud provider consoles (AWS, GCP, Azure)
- Logging and monitoring platforms
- Code repositories and deployment tools
- Communication channels and runbooks
Access should be granted before someone's first on-call shift, not during it. Test this by having new on-call engineers do a dry run where they attempt to access all critical systems.
Alerting platforms
A dedicated on-call platform serves as the central nervous system for incident response. These tools route alerts to the right person, provide escalation paths when someone doesn't acknowledge, and track response metrics.
Key features to look for:
- Multi-channel notifications (push, SMS, phone calls)
- Escalation policies that automatically page secondary responders
- Schedule management with rotation support
- Integration with monitoring and logging tools
- Mobile apps that actually work
The platform should make it easy to see who's on-call at any given time. This transparency prevents the awkward situation where someone pages the entire team because they can't figure out who's supposed to be responding.
Designing sustainable rotation schedules
Schedule design has an outsized impact on team health. Get it wrong and people burn out. Get it right and on-call becomes manageable.
Team size considerations
Single-person teams have no choice but to be on-call 24/7 unless they bring in external help. This is unsustainable for any meaningful length of time.
Two-person teams can alternate, but this still means being on-call roughly 50% of the time. That's a heavy burden.
Three or more people enables weekly rotations where engineers get extended breaks between shifts. This is the minimum viable team size for sustainable on-call.
Follow-the-sun vs 24/7 coverage
If your team spans multiple time zones, a follow-the-sun model makes sense. Engineers in Asia-Pacific cover their daytime hours, then hand off to Europe, then to the Americas. Nobody gets woken up at night.
This requires coordination between geographically distributed teams and clear handoff procedures. But when feasible, it's the most humane approach.
For teams in a single location, someone has to take the night shift. Weekly rotations spread this burden relatively evenly. Some teams try alternating nights, but switching between day and night shifts every few days is particularly disruptive to sleep patterns.
Rotation length
The most common rotation lengths are:
| Rotation Length | Pros | Cons |
|---|---|---|
| Daily | Minimizes disruption to any one person | High cognitive load from frequent context switching |
| Weekly | Balances burden and predictability | Full week of interrupted sleep |
| Bi-weekly | Longer stretches of uninterrupted time | Two weeks is a long time to be on-call |
Weekly rotations win for most teams. They provide enough continuity for the on-call engineer to build context about current issues while not dragging on forever.
Some teams do one week on, one week off. Others do one week on, two weeks off if they have enough people. The specific pattern matters less than consistency and fairness.
Backup and escalation tiers
Primary responders need backup coverage. People sleep through notifications. Phones die. Personal emergencies happen.
A typical escalation structure looks like:
- Primary on-call engineer (responds within 5-15 minutes)
- Secondary on-call engineer (paged if primary doesn't acknowledge in 15 minutes)
- Manager or tech lead (paged for critical incidents or if secondary also doesn't respond)
Some organizations add a fourth tier that pages the entire team for true emergencies. Use this sparingly to avoid alert fatigue.
Setting clear responsibilities and boundaries
Ambiguity about what's expected during an on-call shift creates stress and conflict. Teams need explicit agreements about responsibilities.
What on-call engineers should handle
On-call responsibilities typically include:
- Acknowledging and triaging incoming alerts
- Investigating and resolving incidents within their scope
- Escalating issues that require specialized expertise
- Communicating status updates to stakeholders
- Documenting actions taken during incidents
The goal is to restore service, not to implement perfect fixes. If the database connection pool is exhausted, the on-call engineer might restart services to get things working again, then hand off the root cause investigation to the team that owns that service.
What shouldn't be on-call responsibilities
On-call shifts aren't the time for:
- Feature development work
- Proactive refactoring or optimization
- Responding to non-urgent requests
- Training or learning new systems (except as needed for incidents)
Some teams make the mistake of treating on-call engineers as general-purpose resources for any random task. This erodes the value of the on-call role and burns people out.
Response time expectations
Different incident severities warrant different response times. A suggested framework:
- Critical (P1): Service completely down, acknowledge within 5 minutes
- High (P2): Major functionality impaired, acknowledge within 15 minutes
- Medium (P3): Minor functionality affected, acknowledge within 30 minutes
- Low (P4): No user impact, can wait until business hours
Document these expectations explicitly. Engineers shouldn't have to guess whether they need to drop everything immediately or if they can finish eating dinner.
Personal time boundaries
On-call doesn't mean being glued to a laptop. Engineers should be able to:
- Go to the gym (with their phone)
- Go to dinner or movies (in areas with cell service)
- Run errands and handle personal tasks
- Sleep (with phone volume turned up)
What they can't do is go completely off-grid. No backcountry camping trips during an on-call week. International travel gets complicated by time zones and roaming issues.
Some teams allow shift swapping for special occasions. If someone has concert tickets or a family event during their on-call week, they can trade shifts with a teammate. This flexibility prevents on-call from feeling like a prison sentence.
Building a supportive on-call culture
Culture determines whether on-call is a shared burden or a source of constant frustration.
Psychological safety and blameless postmortems
When things go wrong, teams need to focus on fixing systems, not blaming people. A blameless approach treats incidents as learning opportunities.
This doesn't mean accountability goes out the window. It means recognizing that human error is a symptom of systemic issues. If someone deployed code that caused an outage, the question isn't "why did they screw up?" but rather "why did our process allow broken code to reach production?"
Engineers need to feel safe escalating and asking for help. A culture of blame creates an environment where people try to hide problems or fix things alone rather than pulling in expertise.
Onboarding and training
Nobody should go on-call without proper preparation. A good onboarding program includes:
- Shadow shifts where new engineers observe experienced responders
- Hands-on exercises simulating common incident types
- Walkthrough of all critical systems and dependencies
- Review of runbooks and escalation procedures
- Test pages to verify alerting works
The investment in onboarding pays off when that person handles their first real incident without panicking.
Team participation and ownership
On-call works best when it's evenly distributed across everyone who has the skills to respond. Some teams exempt senior engineers or managers, but this creates resentment.
If someone's too senior to carry a pager, they're too senior to complain about reliability issues. Managers who participate in on-call stay connected to operational reality and make better decisions about prioritization and staffing.
That said, people need adequate skills before going on-call. Junior engineers might shadow for months before taking primary shifts. That's appropriate if it matches their readiness.
Managing alerts and reducing noise
Alert fatigue kills on-call effectiveness. When engineers get paged constantly for non-issues, they start ignoring alerts or acknowledging them without actually investigating.
Alert hygiene
Every alert should be:
- Actionable: The person receiving it can do something about it
- Urgent: It requires immediate attention, not eventual follow-up
- Real: It indicates an actual problem, not a false positive
Alerts that don't meet these criteria should be downgraded to metrics that get checked during business hours or removed entirely.
Common sources of alert noise
Watch out for:
- Flapping alerts: Systems that oscillate between healthy and unhealthy states
- Dependency failures: Alerts about downstream services that you don't control
- Threshold tuning: Alerts triggered by normal usage patterns or traffic spikes
- Test environment noise: Development or staging alerts routed to production on-call
Fixing these requires dedicated time for alert maintenance. Some teams schedule periodic "alert review" sessions where they analyze which alerts fired recently and whether they were valuable.
Intelligent alerting strategies
Better alerting practices include:
- Aggregation: Group related alerts instead of firing dozens individually
- Suppression: Automatically silence alerts for known maintenance windows
- Smart routing: Send database alerts to database experts, not the general on-call rotation
- Severity escalation: Start with low-priority notifications and escalate if conditions worsen
The goal is to page people only when their immediate action matters.
Handoff procedures and documentation
Smooth handoffs prevent dropped context and repeated work.
What to include in handoffs
At the end of each shift, the outgoing engineer should document:
- Open incidents and their current status
- Ongoing issues that aren't incidents but need monitoring
- Actions taken and their results
- Escalations or external communications in flight
- Any system changes or deployments during the shift
This can be a simple doc, a Slack message, or a formal handoff meeting. The format matters less than consistency.
Runbooks and playbooks
Runbooks document how to respond to specific scenarios. They turn tribal knowledge into shared resources.
A good runbook includes:
- Clear description of the problem symptoms
- Step-by-step debugging procedures
- Known resolution steps
- When to escalate and to whom
- Links to relevant logs, dashboards, or documentation
Runbooks should be living documents that get updated after incidents. If the on-call engineer had to figure something out, that knowledge should be captured for next time.
Communication protocols
Who needs to be informed about incidents and when? Define this upfront.
- Customer-facing outages need public status page updates
- Internal service degradation might just need a Slack post
- Extended incidents need regular updates even if nothing has changed
Templates help here. Having a pre-written incident notification template means the on-call engineer can quickly fill in details and send updates without composing messages from scratch under stress.
Time management and workload balance
On-call time isn't free time. It needs to be accounted for in capacity planning.
Dedicated on-call shifts vs split focus
Some teams treat on-call as a separate role where the engineer doesn't work on normal projects. This makes sense for high-interrupt environments where pages are frequent.
Other teams expect engineers to do regular work while on-call, with the understanding that interruptions will happen. This works better for lower-alert-volume scenarios.
The worst approach is expecting full productivity on both fronts. Engineers can't maintain focus on complex work while waiting for potential pages.
Interrupt-driven work challenges
Context switching has real cognitive costs. Getting paged in the middle of writing code means losing that mental state. It takes time to get back into flow.
For high-interrupt on-call periods, assign work that's naturally chunked. Bug fixes, code reviews, and documentation tasks handle interruptions better than architecting new features.
Post-on-call recovery time
After a tough on-call week, especially one with night pages, engineers need recovery time. Some organizations provide:
- Protected time the day after a night page (no meetings, flexible hours)
- Extra PTO earned for particularly brutal on-call periods
- Rotation to low-interrupt work immediately after on-call
Burning people out on-call and then expecting immediate full productivity is shortsighted.
Preventing burnout in on-call teams
Burnout sneaks up gradually, then hits all at once.
Warning signs
Watch for:
- Increased cynicism or negativity about on-call
- Declining response times or engagement
- More sick days, especially around on-call weeks
- Degraded quality of incident responses
- Engineers actively avoiding or complaining about on-call
These are symptoms of systemic problems, not individual weakness.
Monitoring team health metrics
Track data about on-call load:
- Number of pages per shift
- Time to acknowledge alerts
- Incident resolution times
- After-hours pages vs business hours
- Distribution of alert load across team members
If one person consistently gets more pages, something's wrong with either the rotation or alert routing. If everyone's getting paged multiple times per night, the system needs fixing, not tougher engineers.
Systemic improvements
The best way to prevent burnout is reducing toil and improving reliability. This means:
- Automating common incident responses
- Fixing recurring issues permanently instead of band-aiding them
- Improving observability to speed up debugging
- Simplifying complex systems that generate operational burden
This requires organizational commitment. If teams never get time to work on reliability improvements because they're always building features, on-call stays painful.
Continuous improvement through postmortems
Every significant incident is a learning opportunity.
When to write postmortems
Not every blip needs a full postmortem. Focus on:
- Customer-impacting outages
- Novel failure modes
- Incidents that revealed gaps in process or tooling
- Close calls that almost became major incidents
For minor issues, a brief summary in the handoff doc suffices.
Postmortem structure
Effective postmortems cover:
- Timeline of events
- Root cause analysis
- Contributing factors
- What went well in the response
- What could be improved
- Action items with owners and deadlines
The blameless principle applies here. Focus on systems and processes, not individuals.
Following through on action items
Postmortems are worthless if action items never get done. Track them like any other project work. Hold teams accountable for completing remediation tasks.
Some organizations have a rule that action items from postmortems get priority over new features until resolved. This prevents accumulating technical debt from operational issues.
Compensation and recognition
On-call has real costs. Engineers give up flexibility and take on stress. This deserves acknowledgment.
On-call pay structures
Common compensation approaches:
- Stipend: Fixed additional pay for being on-call, regardless of pages
- Per-incident bonus: Payment for each alert responded to
- Overtime pay: Hourly rate for after-hours work
- Time in lieu: Comp time off for on-call hours worked
The specific model depends on local labor laws and company culture. At minimum, on-call shouldn't be an unpaid extra responsibility.
Non-monetary recognition
Money isn't everything. Other forms of recognition matter:
- Public acknowledgment of good incident responses
- Career credit for operational work
- Influence over roadmap priorities to fix painful issues
- Professional development opportunities to build new skills
Teams where on-call work is seen as less valuable than feature work struggle with morale.
Monitoring and tracking on-call health
What gets measured gets managed. Track metrics about on-call effectiveness and team health.
Key metrics to monitor
Important indicators include:
| Metric | What It Shows | Target |
|---|---|---|
| Time to acknowledge | How quickly engineers respond | < 15 minutes for critical |
| Time to resolution | How quickly issues get fixed | Depends on severity |
| Pages per shift | Alert volume and system health | Trending down over time |
| False positive rate | Alert quality | < 10% |
| After-hours pages | Sleep disruption | Minimized where possible |
These metrics help identify problems before they become crises.
Regular retrospectives
Beyond postmortems for specific incidents, hold periodic retrospectives on the on-call process itself. Ask:
- What's working well?
- What's frustrating?
- Where are the gaps in documentation or training?
- What improvements would have the biggest impact?
Create a safe space for honest feedback. Anonymous surveys help if people aren't comfortable speaking up in meetings.
Continuous refinement
On-call practices shouldn't be static. As systems evolve, teams grow, and operational patterns change, the on-call process needs adjustment.
Small, frequent improvements beat big overhauls. Fix one annoying alert this week. Add a runbook next week. Adjust the rotation schedule the week after.
Modern software operations depend on responsive, sustainable on-call practices. The teams that do this well balance system reliability with human wellbeing. They invest in tooling and automation. They create cultures where asking for help is encouraged and learning from failures is expected.
Getting there takes intentional effort. But the payoff is substantial: happier engineers, more reliable systems, and better outcomes for users.
For teams looking to level up their on-call game, tools like Odown provide the infrastructure needed for effective monitoring. With uptime monitoring for websites and APIs, SSL certificate tracking to prevent expiration surprises, and public status pages for transparent communication, Odown handles the technical foundation so teams can focus on building great on-call practices.



