What Does Managed Cloud Infrastructure Actually Cover?

Farouk Ben. - Founder at OdownFarouk Ben.()
What Does Managed Cloud Infrastructure Actually Cover? - Odown - uptime monitoring and status page

Running production workloads in the cloud means somebody has to keep things running. The question is who.

That's what managed cloud infrastructure answers. When you hear "managed," you're really hearing "we run day-two operations for specific layers of the stack." The provider handles patching, monitoring, backups, and incident response for those layers. You still own the rest.

And that boundary matters more than you think.

Most outages don't happen because of hardware failures or hypervisor bugs. They happen because someone misconfigured IAM permissions, forgot to test a restore procedure, or assumed "managed" meant "we don't have to think about this anymore."

It doesn't.

The value of a managed offering is clarity. You pay to stop doing repetitive operational tasks by hand. The provider takes on defined responsibilities. You take on everything else. But if you can't point to that line in a contract, you're going to spend the first hour of an incident arguing about whose problem it is instead of fixing it.

Table of contents

What managed cloud infrastructure actually means

A managed cloud service is an agreement where a provider operates and maintains specific infrastructure layers on your behalf. This typically includes compute resources, storage systems, networking components, and platform services.

The core value proposition is operational offload. You stop worrying about OS patches on control plane nodes. You stop writing Ansible playbooks to rotate certificates on managed load balancers. You stop setting up Prometheus exporters for infrastructure metrics that the provider already monitors.

But offload doesn't mean abdication.

You still configure everything the provider gives you. You still deploy workloads, manage application dependencies, set up IAM policies, and test disaster recovery procedures. The provider keeps their layer running. You keep your layer running.

Most managed infrastructure follows one of three models: Infrastructure-as-a-Service (IaaS), Platform-as-a-Service (PaaS), or Software-as-a-Service (SaaS). Each model draws the responsibility line in a different place.

IaaS providers manage the physical data center, networking hardware, storage arrays, and hypervisors. You manage everything above that. Virtual machines, operating systems, runtime environments, applications, data.

PaaS providers go further. They manage the OS and runtime. You manage application code, configuration, and data.

SaaS providers manage almost everything. You manage user access and whatever configuration options the application exposes.

The more the provider manages, the less control you have. That trade-off is fine if you don't need control. It's a problem if you do.

The shared responsibility model

Cloud providers describe this split as "shared responsibility." The provider secures and operates the infrastructure. You secure and operate what runs on it.

The physical data center is always the provider's problem. Power distribution, cooling systems, physical access controls, hardware lifecycle management. You never touch any of that.

The virtualization layer is usually the provider's problem. Hypervisor patching, host OS hardening, storage replication, network fabric operation.

Everything above virtualization depends on what you're buying.

If you rent virtual machines, the guest OS is yours. You patch it. You configure it. You harden it. The provider doesn't log into your VMs.

If you buy managed Kubernetes, the control plane is theirs. API server uptime, etcd backups, control plane version upgrades. But the worker nodes might be yours or theirs depending on the service tier. RBAC policies, admission controllers, network policies, and workload deployments are always yours.

If you use managed databases, the database engine is theirs. Replication, failover, backups, version upgrades. But schema design, query optimization, access control, and data retention policies are yours.

The line moves depending on the service. That's why you need a written responsibility matrix.

What providers typically manage

Most managed infrastructure offerings include a similar set of operational tasks. These are the things you're paying to stop doing yourself.

Platform uptime and availability

Providers commit to keeping their layer running. That commitment usually comes with a service level agreement that specifies an uptime percentage and defines what counts as downtime.

A 99.99% uptime SLA allows roughly 52 minutes of downtime per year. A 99.95% SLA allows about 4 hours. The difference matters if you're running production services with tight RTO requirements.

Uptime SLAs typically apply to the control plane or management API, not to your workloads. If the Kubernetes API server is up but your pods are crashlooping because of a bad deployment, the provider isn't in breach.

Patching and version management

Providers patch and upgrade components they control. Security updates for the hypervisor, bug fixes for managed database engines, version upgrades for control plane services.

But patching scope varies. Some providers automatically patch worker node operating systems. Others make you opt in. Some handle application runtime updates (like Node.js or Python versions). Others leave that to you.

Check the patch policy before you deploy. If you're expecting automatic security patching but the provider only patches monthly during maintenance windows, you have a gap.

Monitoring and alerting

Managed services include operational telemetry for the provider's layer. CPU utilization, memory pressure, disk I/O, network throughput, service health checks.

The provider monitors this data and responds to threshold violations. If a hypervisor starts dropping packets, they fix it. If a managed database replica falls out of sync, they investigate.

But they don't monitor your application metrics. They don't know if your API response times are spiking or if your background job queue is backing up. You still need application performance monitoring and log aggregation for your layer.

Backup and replication

Most managed infrastructure includes data durability features. Storage replication across availability zones, automated database backups, snapshot scheduling.

Replication prevents data loss from hardware failure. Backups let you recover from logical errors like accidental table drops or corrupted data.

But backup features don't automatically give you a working disaster recovery plan. You still need to validate restore procedures, document recovery steps, and test failover under load.

Incident response for their layer

When something breaks in the provider's layer, they respond. If the control plane API goes down, they triage and fix it. If storage performance degrades, they investigate and remediate.

Support boundaries vary by service tier. Basic support might mean "open a ticket and wait for email response." Premium support might include phone escalation and faster SLA response times.

But support only covers the provider's scope. If your application crashes because of a memory leak, provider support won't debug it. If your database query runs slowly because of a missing index, they'll tell you to optimize the query.

What you still own

The operational tasks you keep are just as important as the ones you offload. Miss something here and "managed infrastructure" won't save you.

Identity and access management

You own every access policy in your environment. User identities, service accounts, API keys, role-based access control, permission boundaries, multi-factor authentication settings.

If someone misconfigures IAM and grants overly permissive access, the blast radius is on you. Providers give you IAM tools. They don't audit your policies or tell you when permissions are too broad.

Most breaches start with compromised credentials or overly permissive access policies. Managed infrastructure doesn't fix that.

Data protection and encryption

Providers offer encryption features. Encryption at rest for storage volumes, encryption in transit for network traffic, key management services for cryptographic operations.

But you choose whether to use those features. You configure encryption settings. You manage key rotation schedules. You decide what data needs encryption and what compliance frameworks apply.

You also own data classification, retention policies, and deletion procedures. If regulations require you to delete customer data within 30 days of a request, the provider isn't tracking that for you.

Network configuration

Providers run the physical network and the virtualization layer. You configure everything on top of it.

Virtual private clouds, subnet definitions, routing tables, security groups, firewall rules, load balancer configuration, DNS records, private connectivity, VPN tunnels.

A misconfigured security group can expose databases to the public internet. A bad routing rule can blackhole production traffic. Providers give you networking primitives. You assemble them into a working architecture.

Application deployment and operation

Everything about your workloads is your problem. Container images, deployment manifests, scaling policies, health checks, service discovery, configuration management, environment variables, secrets management.

If your application has a memory leak, that's yours to fix. If a deployment config references a missing ConfigMap, that's yours to debug. If a service mesh configuration breaks inter-service communication, that's yours to untangle.

Managed infrastructure keeps the platform running. It doesn't make your code work.

Restore testing and recovery validation

Some managed services include backup automation. Very few include restore testing.

And backups are worthless if you can't restore them under pressure.

You own the restore runbook. You own the validation checklist. You own the drill schedule. You own the decision about whether a backup is viable or corrupted.

If your RTO is 15 minutes but you've never timed a restore, you don't actually know if you can hit that target.

The responsibility matrix

This is the table you put in your statement of work and reference during incident calls.

Layer Provider typically owns You typically own
Physical infrastructure Data center operations, power, cooling, physical security Nothing
Compute hardware Server lifecycle, hardware replacement, hypervisor patching Nothing
Network fabric Routing infrastructure, backbone connectivity, DDoS mitigation Virtual network design, subnets, security groups, firewall rules
Storage systems Replication, durability, snapshot mechanics Data classification, retention policies, encryption configuration
Managed Kubernetes control plane API server uptime, etcd backups, control plane upgrades RBAC policies, admission controllers, namespaces, workload configuration
Worker nodes Varies by tier (sometimes provider, sometimes customer) Node configuration, runtime add-ons, scaling policies, OS hardening if customer-managed
Managed database engine Replication, failover, engine patches, version upgrades Schema design, query optimization, connection pooling, access control
Backup and DR orchestration Backup scheduling, replication mechanics, snapshot storage Restore validation, recovery runbooks, failover testing, application-level consistency
Monitoring and logging Infrastructure metrics, platform health checks, alerting for provider layer Application performance monitoring, log aggregation, custom metrics, alert tuning
Security Physical security, hypervisor hardening, network isolation IAM policies, encryption configuration, vulnerability patching for customer-managed components

The exact split depends on the service contract. Some providers manage worker nodes. Others don't. Some providers automatically patch guest operating systems. Others require you to opt in.

Get the provider to fill out this table before you sign. Then reference it when something breaks.

Service level agreements and uptime claims

An SLA is a contract term that defines availability targets and remedies for failing to meet them.

Most managed infrastructure SLAs specify an uptime percentage for a specific component. A 99.99% SLA for the Kubernetes control plane means the API server will be reachable 99.99% of the time, measured over a calendar month.

SLAs usually exclude downtime caused by maintenance windows, customer misconfigurations, third-party failures, or force majeure events. Read the exclusions carefully. If scheduled maintenance doesn't count against uptime, find out how much scheduled maintenance the provider actually does.

SLA remedies are typically service credits. If the provider misses their uptime target, you get a percentage of your monthly bill back. The credit amount scales with how far below the target they fell.

Credits sound nice. They don't fix an outage or compensate for lost revenue.

The real value of an SLA is accountability. It creates a measurable commitment. It gives you negotiating leverage during renewals. It forces the provider to document what they're responsible for.

But SLAs don't cover your workloads. If the control plane is up but your application is down because of a bad configuration change, the provider isn't in breach.

Your RTO and RPO requirements might be tighter than the provider's SLA. If you need 99.99% application availability but the provider only commits to 99.95% infrastructure availability, you need redundancy and failover automation to close the gap.

Why the boundary matters during incidents

When something breaks, the first question is always "whose problem is this?"

If the boundary is clear, you route the ticket correctly and start fixing things. If the boundary is vague, you spend 30 minutes on a call arguing about scope while downtime accumulates.

Managed infrastructure doesn't eliminate operational burden. It shifts it. Tasks you offload to the provider become tasks you coordinate with the provider during incidents.

If the control plane is slow, you open a support ticket and wait for the provider to investigate. If worker nodes are crashlooping because of a kernel panic, you need to know whether node OS management is your responsibility or theirs before you start troubleshooting.

Fast incident response depends on knowing exactly where the handoff happens. That means documentation. Runbooks that say "if X fails, open a ticket with provider support." Escalation procedures that specify what you diagnose internally vs what you escalate externally. Monitoring that separates infrastructure health from application health so you know which layer failed.

The boundary also affects post-incident learning. If an outage involved both your misconfigurations and provider service degradation, you need separate action items for each. One action item goes in your backlog. The other action item goes in a support ticket asking the provider to file their own postmortem.

Common managed service models

Different service models draw the boundary in different places.

Managed virtual machines

The provider runs the hypervisor and physical infrastructure. You run everything inside the VM.

You choose the OS image. You apply patches. You install packages. You configure firewall rules. You deploy applications. You monitor resource utilization.

Managed VMs give you maximum control. They also give you maximum operational work. If you're running dozens of VMs, you need configuration management tools and patch automation or you'll spend all your time doing manual updates.

Managed Kubernetes

The provider runs the Kubernetes control plane. API server, scheduler, controller manager, etcd. They handle control plane upgrades, backups, and availability.

Worker node management varies. Some providers manage the node OS and runtime. Others make you manage node pools yourself. Always check which model you're buying.

You still own everything Kubernetes schedules. Deployments, pods, services, ingress rules, config maps, secrets, RBAC policies, network policies, admission webhooks.

Managed Kubernetes offloads control plane toil. It doesn't reduce the complexity of running containerized applications.

Managed databases

The provider runs the database engine, handles replication, performs automated backups, and manages version upgrades.

You design the schema, write queries, configure connection pooling, set up read replicas, tune performance, and manage access credentials.

Managed databases are popular because database administration is specialized work. You get high availability and automated backups without hiring a DBA.

But managed databases don't fix slow queries. They don't optimize indexes. They don't prevent schema design problems.

Managed storage

The provider runs the storage service. They handle replication, durability, and data integrity.

You configure buckets, set access policies, manage lifecycle rules, enable encryption, and control versioning.

Object storage is almost always managed. Block storage and file storage are sometimes managed, sometimes customer-operated depending on the service.

Storage is the easiest thing to offload because storage systems are complex and storage failures are catastrophic. Let the provider worry about bit rot and disk failures.

Evaluating a managed cloud provider

Not all managed services are equal. Some providers manage more than others. Some have better tooling, clearer documentation, and faster support response times.

Start by defining what you want to offload. If you're tired of patching servers, you need managed compute. If you're tired of database operations, you need managed databases. If you're tired of everything, you might need a fully managed private cloud.

Then evaluate providers on these dimensions.

Scope of management. What exactly does the provider operate? Get specifics. "Managed Kubernetes" can mean just the control plane or it can include node OS patching, runtime upgrades, and add-on management. The price is probably different too.

SLA terms. What uptime target do they commit to? What components does it cover? What are the exclusions? What remedies do you get if they miss it?

Support options. What channels can you use for support? Email only, or phone escalation? What are the response time SLAs for different severity levels? Do they charge extra for faster support?

Patch policy. What do they patch automatically? What requires you to opt in? How much notice do they give before applying disruptive updates?

Monitoring and visibility. What metrics do they expose? Can you export logs and metrics to your own observability stack? Do they provide dashboards and alerts for their layer?

Backup and DR features. What backup mechanisms do they provide? How do restores work? Do they test restores themselves? What RTO and RPO can they achieve?

Compliance and certifications. Do they have the compliance certifications your industry requires? SOC 2, ISO 27001, HIPAA, PCI DSS?

Documentation quality. Is the documentation clear and complete? Are there runbooks for common operational tasks? Is the responsibility matrix documented?

Vendor stability. How long has the provider been operating? What's their financial situation? Are they likely to be around in five years?

Cost structure and hidden expenses

Managed services reduce operational toil. They don't necessarily reduce total cost.

You're trading internal labor costs for external service costs. Whether that trade is favorable depends on how much operational work you're offloading and how much the provider charges.

Most managed services use consumption-based pricing. You pay for compute hours, storage gigabytes, network egress, API requests, or some combination.

Watch for hidden costs.

Support tiers. Basic support might be included, but faster response times often cost extra.

Data egress. Moving data out of the provider's network usually incurs charges. If you're pulling large datasets for analytics or backups, egress costs add up.

Backup storage. Automated backups are often included, but storing those backups for long retention periods might be billed separately.

Premium features. High availability, multi-region replication, and advanced monitoring might be add-ons rather than standard features.

Scaling overhead. Some managed services have minimum instance sizes or reserved capacity requirements that drive up costs at small scale.

Calculate total cost of ownership over a realistic time horizon. Include service costs, data transfer costs, support costs, and the internal labor costs for tasks you still own.

Security and compliance ownership

Managed infrastructure doesn't absolve you of security responsibilities. The provider secures their layer. You secure yours.

The provider hardens the hypervisor, patches the control plane, secures the physical data center, and monitors for infrastructure-level attacks.

You harden workloads, configure access controls, encrypt sensitive data, manage credentials, audit logs, and respond to application-layer security events.

Most compliance frameworks hold you accountable for security outcomes even if you're using managed services. If customer data leaks because of misconfigured access policies, you're liable regardless of who runs the infrastructure.

Security tasks you still own:

  • IAM policies and permission boundaries
  • Encryption key management
  • Network segmentation and firewall rules
  • Vulnerability scanning and patch management for customer-managed components
  • Log retention and audit trails
  • Incident response procedures
  • Compliance documentation and audit preparation

Providers typically offer tools for these tasks. You decide whether to use them and how to configure them.

Disaster recovery and backup responsibilities

Managed services usually include backup primitives. Automated snapshots, cross-region replication, point-in-time recovery.

But primitives aren't a plan.

You own the recovery runbook. You own the restore validation process. You own the failover decision tree.

Backups only help if you can restore them. And restoring under pressure is different from restoring during a test.

Build a DR plan that specifies:

  • What gets backed up and how often
  • Where backups are stored and how long they're retained
  • Who can initiate a restore and under what circumstances
  • Step-by-step restore procedures for each component
  • Application-level consistency checks after restore
  • Failover procedures for multi-region deployments
  • Communication plans for notifying stakeholders during recovery

Then test the plan. Schedule quarterly DR drills. Restore from backup in a staging environment. Measure how long it takes. Document what breaks. Fix the gaps.

RTO and RPO targets are worthless if you've never validated them under realistic conditions.

Questions to ask before signing

Get answers to these questions in writing before you commit to a managed service contract.

  1. What is the exact scope of management? Which layers does the provider operate and which layers do I operate?
  2. What is the SLA? Which components does it cover? What are the exclusions? What remedies do I get if the SLA is missed?
  3. Who patches what? What's the patch schedule? How much notice do I get before disruptive updates?
  4. What is the support boundary? What issues will support help with? What issues are out of scope? What voids support?
  5. How do backups work? What's backed up automatically? How do I initiate a restore? What's the expected RTO and RPO?
  6. How do upgrades work? Are they automatic or manual? What's the maintenance window schedule? Can I defer upgrades?
  7. What monitoring and visibility do I get? What metrics are exposed? Can I export logs and metrics?
  8. What security controls are in place? How does the provider handle security patching? What compliance certifications do they have?
  9. What happens during an incident? How do I escalate? What information do I need to provide? What response times can I expect?
  10. How do I test disaster recovery? Can I run restore drills in a non-production environment?

Put the answers in your SOW. Reference them during incidents.

Making managed services work

Managed cloud infrastructure works when expectations align with reality.

The provider operates specific layers. You operate everything else. Success depends on understanding that boundary and scripting operational procedures around it.

Start by documenting the responsibility matrix. Map every layer of your stack to either "provider owns" or "you own." Get the provider to agree to that mapping in writing.

Then build runbooks that respect the boundary. If a component is provider-managed, the runbook should say "open support ticket" instead of "SSH into the host and debug."

Test disaster recovery procedures regularly. Backups are useless if you can't restore them under pressure. Schedule quarterly drills and measure actual recovery times.

Monitor both sides of the boundary. Track infrastructure metrics for provider-managed components so you know when to escalate. Track application metrics for your components so you know when the problem is yours.

Managed infrastructure reduces operational toil. It doesn't eliminate engineering work. You still design systems, deploy code, configure services, and respond to incidents. You just coordinate with a provider for the layers they run.

That coordination is worth it if you'd rather spend time on application logic than infrastructure maintenance. But it only works if both sides know exactly what they're responsible for.

For teams running critical workloads, uptime monitoring becomes non-negotiable regardless of whether infrastructure is managed or self-hosted. That's where tools like Odown come in. Odown provides real-time uptime monitoring for websites and APIs, public status pages for transparent incident communication, and SSL certificate monitoring to prevent expiration-related outages. When you're operating at the application layer on top of managed infrastructure, having reliable monitoring that tracks your services independently of the provider's metrics gives you the visibility you need to meet your own SLA commitments.