Server Health Checks for Application Reliability

Apr 28, 2025

Server Health Checks for Application Reliability - Odown - uptime monitoring and status page

In the software development world, server health checks are often overlooked until something breaks. I've seen many companies scramble to implement monitoring after experiencing a catastrophic outage—when customers are already angry and revenue has been lost. Let's not make that mistake.

Server health checks are crucial monitoring mechanisms that verify if your servers, applications, and services are functioning correctly. They're the digital equivalent of a doctor's checkup for your infrastructure.

What Are Server Health Checks?
Why Server Health Checks Matter
Types of Server Health Checks
Key Metrics to Monitor
Implementing Effective Health Checks
Common Health Check Endpoints
Health Check Best Practices
Health Checks in Load Balancing
Health Checks in Container Environments
Debugging Failed Health Checks
Health Check Tools and Services
Integrating Health Checks with Monitoring Systems
Building Resilient Systems with Health Checks
Conclusion

What Are Server Health Checks?

Server health checks are automated tests that continuously verify the operational status of your servers, applications, and related services. They're simple in concept but powerful in practice.

A health check typically works by sending a request to a specific endpoint on your server or application and evaluating the response. If the response matches expected criteria, the service is considered healthy. If not, it's flagged as unhealthy, potentially triggering alerts or automatic remediation.

The beauty of health checks lies in their simplicity. While they can be sophisticated, even a basic ping test that confirms server reachability provides valuable information about your system's state.

Why Server Health Checks Matter

I once worked with a company that lost over $50,000 in revenue due to an undetected server issue that lasted just four hours. Their application had silently failed, but because they had no health checks in place, nobody noticed until customers started complaining.

Health checks help prevent such situations by:

Detecting failures early - Identify issues before they impact users
Enabling automatic recovery - Trigger restarts or failovers when problems occur
Supporting scaling operations - Inform load balancers about which instances can receive traffic
Providing visibility - Generate data about system health over time
Improving reliability - Help maintain high availability through early intervention

For critical applications, health checks aren't optional—they're essential infrastructure components that help maintain service quality and reliability.

Types of Server Health Checks

Let's explore the different types of health checks you might implement:

Basic Connectivity Checks

These verify that a server is reachable and responsive:

Ping checks - Use ICMP to verify that a server responds to network requests

Port checks - Verify that specific TCP/UDP ports are open and accepting connections

DNS checks - Confirm that DNS resolution works correctly for your domain

Basic checks tell you if your server is accessible, but not necessarily if your application is functioning properly.

Application-Level Checks

These dig deeper to verify application functionality:

HTTP checks - Verify that web servers return expected status codes

API endpoint checks - Test that API endpoints return valid responses

Database connection checks - Verify database connectivity and basic query functionality

Authentication checks - Ensure authentication systems are working

Application checks provide more meaningful information about service health than basic connectivity checks.

Synthetic Transactions

These simulate user behavior to verify end-to-end functionality:

User flow checks - Automate common user paths through your application

Form submission checks - Test that forms accept and process input correctly

Payment processing checks - Verify that payment systems work (using test transactions)

Synthetic transactions are the most comprehensive health checks, but also the most complex to implement.

Dependency Checks

These verify that external services your application depends on are available:

Third-party API checks - Confirm that external APIs respond correctly

CDN checks - Verify that content delivery networks are serving assets

Payment gateway checks - Ensure payment processors are operational

Dependency checks help distinguish between failures in your systems versus failures in external services.

Key Metrics to Monitor

Health checks should assess various aspects of server performance, including:

System-Level Metrics

Metric	Description	Typical Threshold
CPU Usage	Percentage of processor capacity used	70-80%
Memory Utilization	Percentage of RAM in use	70-80%
Disk Space	Available storage capacity	80-90% used
Disk I/O	Speed of read/write operations	Varies by hardware
Network Throughput	Data transfer rate	Varies by network

Application-Level Metrics

Metric	Description	Typical Threshold
Response Time	Time to process requests	100-300ms
Error Rate	Percentage of requests resulting in errors	0.1-1%
Request Rate	Number of requests per second	Varies by application
Active Connections	Number of concurrent connections	Varies by application
Queue Depth	Pending requests or jobs	Should trend toward zero

Service-Specific Metrics

For databases:

Query execution time

Connection pool utilization

Lock contention

Index performance

For web servers:

Time to first byte

SSL/TLS handshake time

Cache hit ratio

Thread pool utilization

For message queues:

Queue depth and growth rate

Message processing time

Dead letter queue size

Consumer lag

The specific metrics you monitor will depend on your application architecture and business requirements. Start with the basics and expand as needed.

Implementing Effective Health Checks

Creating effective health checks requires balancing thoroughness with performance impact. Here's how to implement them properly:

1. Define Health Check Endpoints

For web applications and APIs, create dedicated endpoints that perform appropriate checks:

  GET /health

  GET /health/ready

  GET /health/live

Each endpoint can serve a different purpose:

/health - Overall application health

/health/ready - Readiness for traffic (for load balancers)

/health/live - Basic liveness check (for container orchestrators)

2. Choose Appropriate Response Formats

Health checks should return clear, parseable responses. JSON is common:

  {
"status": "healthy",
"version": "1.2.3",
"checks": [

      {

        "component": "database",

        "status": "healthy",

        "time": 15

      },

      {

        "component": "cache",

        "status": "healthy",

        "time": 2

      }

    ],
"timestamp": "2025-05-20T15:04:05Z"

  }

Include relevant details but avoid exposing sensitive information.

3. Set Appropriate Check Frequency

Balance frequency against performance impact:

Critical services: 10-30 second intervals

Standard services: 30-60 second intervals

Non-critical services: 1-5 minute intervals

Remember that very frequent checks can themselves impact performance.

4. Implement Timeout and Retry Logic

Health checks should fail fast:

Set short timeouts (1-5 seconds typically)

Use appropriate retry logic for transient issues

Avoid cascading failures by degrading check frequency when systems are under load

5. Consider Authentication for Health Checks

For public-facing applications, decide whether health check endpoints need protection:

Public basic health endpoints may be acceptable

Detailed health information should be protected

Consider IP restrictions, simple tokens, or basic auth

The goal is to balance security with simplicity—health checks should work even when other systems fail.

Common Health Check Endpoints

Different frameworks and platforms have adopted conventions for health check endpoints:

Spring Boot Applications

Spring Boot's Actuator module provides several endpoints:

/actuator/health - Overall health information

/actuator/info - Application information

/actuator/metrics - Detailed metrics

Example response:

  {
"status": "UP",
"components": {
"db": {
"status": "UP",
"details": {

          "database": "MySQL",

          "validationQuery": "isValid()"

        }

      },
"diskSpace": {
"status": "UP",
"details": {

          "total": 500107862016,

          "free": 219662336000

        }

      }

    }

  }

Node.js Applications

For Express applications, a simple health check might look like:

  app.get('/health', (req, res) => {
const healthcheck = {

      uptime: process.uptime(),

      message: 'OK',

      timestamp: Date.now()

    };

    res.status(200). json(healthcheck);

  });

Kubernetes Readiness/Liveness Probes

Kubernetes uses distinct probe types:

Liveness probes - Determine if a container needs to be restarted
Readiness probes - Determine if a container can receive traffic
Startup probes - Determine if an application has started successfully

livenessProbe:
httpGet:

      path: /health/live

      port: 8080
    
initialDelaySeconds: 30
periodSeconds: 10
  

Health Check Best Practices

After implementing countless health checks across different systems, I've learned several valuable lessons:

Keep health checks lightweight - They should execute quickly and consume minimal resources.
Make health checks meaningful - They should verify actual functionality, not just that a process is running.
Include dependency checks - Verify that required external services are available.
Implement circuit breakers - Don't let dependency failures cascade to your service's health.
Use appropriate status codes - For HTTP checks, use standard status codes:
- 200 OK: Fully healthy
- 503 Service Unavailable: Not ready or unhealthy
- 500 Internal Server Error: Check itself failed
Include version information - Health checks are an excellent place to report version details.
Avoid side effects - Health checks should be read-only and not modify system state.
Log check failures - But implement rate limiting to prevent log flooding.
Test failure scenarios - Verify that health checks correctly report unhealthy states.
Document health check endpoints - Include them in your service documentation.

Health Checks in Load Balancing

Load balancers use health checks to determine where to send traffic. This is critical for maintaining high availability.

How Load Balancer Health Checks Work

The load balancer periodically sends requests to each backend server
If a server responds appropriately, it remains in the pool
If a server fails to respond or returns an error, it's removed from the pool
Once removed, the server is checked at intervals until it recovers
When health is restored, the server is added back to the pool

Common Load Balancer Health Check Settings

Setting	Description	Typical Value
Path	Endpoint to check	`/health` or `/ping`
Interval	Time between checks	5-30 seconds
Timeout	Maximum response time	2-5 seconds
Unhealthy Threshold	Failed checks before removal	2-3 checks
Healthy Threshold	Successful checks for reinstatement	2-5 checks

Health Check Configuration Examples

For AWS Application Load Balancer:

  {

    "HealthCheck Protocol": "HTTP",

    "HealthCheck Port": "80",

    "HealthCheckPath": "/health",

    "HealthCheckInterval Seconds": 30,

    "HealthCheckTimeout Seconds": 5,

    "HealthyThreshold Count": 2,

    "UnhealthyThreshold Count": 2
  
  }

For NGINX:

upstream backend {

    server backend1.example. com:8080 max_fails=3 fail_timeout=30s;

    server backend2.example. com:8080 max_fails=3 fail_timeout=30s;
  
  }

Health Checks in Container Environments

Container orchestrators like Kubernetes rely heavily on health checks to manage container lifecycles.

Kubernetes Probe Types

Kubernetes uses three distinct probe types, each serving a different purpose:

Liveness Probes determine if a container is running properly. If a liveness probe fails, Kubernetes restarts the container.

livenessProbe:
httpGet:

      path: /health/live

      port: 8080
    
initialDelaySeconds: 30
periodSeconds: 10
timeoutSeconds: 5
failureThreshold: 3
  

Readiness Probes determine if a container is ready to accept traffic. If a readiness probe fails, Kubernetes stops sending traffic to the container.

readinessProbe:
httpGet:

      path: /health/ready

      port: 8080
    
initialDelaySeconds: 5
periodSeconds: 10
  

Startup Probes determine if an application within a container has started. This is particularly useful for slow-starting applications.

startupProbe:
httpGet:

      path: /health/startup

      port: 8080
    
failureThreshold: 30
periodSeconds: 10
  

Docker Health Checks

Docker also supports built-in health checks:

HEALTHCHECK --interval=30s --timeout=3s \

    CMD curl -f http://localhost/health || exit 1

These health checks integrate with Docker's container management, enabling automatic restarts of unhealthy containers.

Debugging Failed Health Checks

When health checks fail, systematic debugging is essential. Here's a troubleshooting approach:

Check logs first - Application and server logs often contain error details
Verify network connectivity - Ensure network paths are open
Test manually - Try accessing the health check endpoint directly
Check dependencies - Verify that all required services are available
Monitor resource utilization - Look for CPU, memory, or disk space issues
Review recent changes - Code deploys, configuration changes, or infrastructure updates
Check for maintenance windows - Planned outages of dependent services

Common causes of health check failures include:

Insufficient resources (CPU, memory, disk)

Network connectivity issues

Database connection problems

Configuration errors

Application bugs

Dependency failures

Certificate expirations

Keep a runbook of common issues and their resolutions to speed up troubleshooting.

Health Check Tools and Services

A variety of tools can help implement and manage health checks:

Open Source Monitoring Tools

Prometheus with Alertmanager - Metrics collection and alerting
Grafana - Visualization and dashboarding
Nagios/Icinga - Traditional server monitoring
Zabbix - Enterprise-class monitoring
Healthchecks.io - Simple cron job monitoring

Cloud Provider Solutions

AWS CloudWatch - Monitoring for AWS resources
Google Cloud Monitoring - Formerly Stackdriver
Azure Monitor - Microsoft's monitoring solution
Datadog - Comprehensive monitoring and analytics
New Relic - Performance monitoring

Load Balancer Health Checks

AWS Elastic Load Balancing - Health checks for EC2 instances
Google Cloud Load Balancing - Health checks for GCP resources
Azure Load Balancer - Health checks for Azure VMs
HAProxy - Open-source load balancer with health check capabilities
NGINX Plus - Commercial NGINX with advanced health checks

Each tool has its strengths and ideal use cases. For complex environments, a combination of tools often provides the most comprehensive coverage.

Integrating Health Checks with Monitoring Systems

Health checks become even more powerful when integrated with broader monitoring systems. Here's how to connect them:

Exposing Health Metrics

Make health check results available as metrics:

  # HELP api_health_status Current health status (1 = healthy, 0 = unhealthy)

  # TYPE api_health_status gauge

  api_health_status 1

  # HELP api_health_check _duration_seconds Time taken to execute health check

  # TYPE api_health_check_ duration_seconds histogram

  api_health_check_duration _seconds{component="database"} 0.023

  api_health_check_duration _seconds{component="cache"} 0.002

Setting Up Alerts

Configure alerting based on health check results:

# Prometheus alerting rule example
groups:

    - name: health.rules
rules:

        - alert: ServiceUnhealthy
expr: api_health_status == 0
for: 5m
labels:

            severity: critical
          
annotations:

            summary: "Service health check failing"

            description: "The service health check has been failing for 5 minutes."
          

Creating Dashboards

Visualize health status across your infrastructure:

Overall service health
Health check response times
Failed check count
Health check history

A well-designed dashboard provides at-a-glance understanding of system status.

Implementing Escalation Policies

Not all health check failures require immediate action. Set up tiered response based on:

Service criticality
Time of day
Duration of failure
Impact scope

For instance, a minor service failing outside business hours might only trigger a notification, while a critical service failure during peak hours could trigger a full incident response.

Building Resilient Systems with Health Checks

Health checks aren't just for monitoring—they're fundamental building blocks for resilient systems:

Auto-Remediation

Use health check failures to trigger automatic remediation:

Restart containers or services
Scale up resources
Failover to standby systems
Flush caches
Reset connections

For example, a simple bash script might:

#!/bin/bash
response=$(curl -s -o /dev/null -w "%{http_code}" http://localhost/health )
if [ $response -ne 200 ]; then

    echo "Health check failed, restarting service"

    systemctl restart myservice
  
fi

Circuit Breaking

Implement circuit breakers that use health checks to prevent cascading failures:

When dependent services fail health checks, stop trying to use them
Return fallback responses or gracefully degrade functionality
Periodically check if the dependency has recovered
Resume normal operation when health is restored

Libraries like Hystrix, Resilience4j, or Polly can help implement this pattern.

Self-Healing Architecture

Design systems that automatically recover from failures:

Use container orchestration platforms like Kubernetes
Implement redundancy at multiple levels
Design for horizontal scaling
Use stateless services where possible
Implement retry logic with exponential backoff

Health checks provide the signals that trigger these self-healing mechanisms.

Conclusion

Server health checks are a critical component of modern application infrastructure. They provide early warning of issues, enable automatic remediation, and support resilient system design. By implementing comprehensive health checks across your stack, you can significantly improve reliability and reduce downtime.

Remember these key principles:

Keep health checks lightweight but meaningful

Monitor both system and application-level metrics

Integrate health checks with your broader monitoring strategy

Use health check results to drive automatic remediation

Design your architecture to be self-healing

If you're looking for a robust solution to monitor your website, API, or SSL certificates, Odown provides comprehensive health checking and uptime monitoring. Its features include customizable health checks, detailed performance metrics, and instant alerts when issues arise. Additionally, Odown's public status pages keep your users informed about service health, enhancing transparency and trust.

With proper implementation of server health checks and tools like Odown, you can build resilient systems that recover quickly from failures and provide reliable service to your users.