xgerman

xgerman's technology blog

Yolo_cloud

Introduction

Let's assume you have SLAs and there are costs associated with them: You try to run things as reliably as possible. Your databases are redundant, you carefully plan maintenances, and if a CVE happens you get paged and stay on until it's fixed. You spend a lot of time and resurces on uptime.

Enter the YOLO cloud: Databases go down, CVEs are ignored, and you spontaineously shut down servers for a day or two to exchange a hard drive… and then you complain that load balancers go into ERROR.

What does “Operator Grade” mean

When we designed Octavia to be operator grade we targeted it for big public clouds where downtime of an LB (or any other service) meant real money loss. A NOC would monitor 24x7 and early indicators would be followed up on: e.g. if Octavia had trouble with nova, neutron, the queue, or the DB - it would just be a matter of time until support calls about those services piled up. Hence, Octavia would throw things into ERROR at the slightest chance of something being wrong for people to follow up on.

Even when an LB is in ERROR it still works for customers (hence not causing SLA violations) but might not have the same redundancy than a healthy LB.

Contrast this with a service like nova. Once a vm is provisioned, it will just say “ONLINE”. The server might explode afterwards, the control plane DB go down, etc. – nova won't care. Hence, if you build a service on top of VMs (OpenStack or other cloud) you need to have some way of dealing with vms disappearing – but, and this is important, it's not the problem of the cloud operator. Things disappear - deal with it.

The expectation for a LoadBalancer is different: they should always be on. You want a customer get at least an error page instead of the browser timing out. So Octavia doesn't have the luxury of nova's set and forget. If it is a service on top of nova like Octavia it needs to mitigate crashes of the unerlying infrastructure without making things worse (Octavia has rate limits for nova for instance).

To be always up Octavia monitors all LBs, all the time and then tries to self heal them if necessary. If it can't self heal, because some other service is down, it notifies the operator – which it does by changing the provisioning status of an LB into “ERROR” but keeping the LB running (so no service outage for the user) in the hope the operator fixes the issue quickly before it spreads to other parts of the cloud.

Embrace the YOLO cloud

Today's private clouds and to some extend public clouds are pretty much YOLO. With Spectre and Rowhammer it might be safer to assume you are hacked then even bother to apply kernel patches. There often is little buisness value to keep a cloud up 99.999% so shutting down random things at random times becomes the norm. Most OpenStack tooling (OSA, TripleO) tends to shut down control plane services like database and queue during upgrades. As a case study Kubernetes will retry forever until the underlying cloud improves or the service stops crash looping and starts working. So outages are just propagated to the user without the system necessarily notifying somebody.

What does this mean practically? Should a service signal happiness until things are royally broken? Is loosing permanently one server of an ACTIVE-PASSIVE pair a degradation or an error? Is a servcie insecure when it assumes timely patches of CVEs - or should just be assumed things are hacked and code being obfuscated sufficientlty to survive memory dumps?

If your run a YOLO cloud you will say YES to all of that since it doesn't come out of your budget - and if you write software to run on some YOLO cloud you likely say they should do a better job (since all that chaos engineering comes out of your budget). Most people are caught in the middle: It would be nice if the database wouldn't shut down so often but ops is not gonna fix that so we yelled at and have to mitigate in software. Without database our service is unusable (unless we want to get into the business of loosing data) - this is why exponential backoff was invented? The world waits and software gets crappier by the day (see also https://www.theatlantic.com/technology/archive/2017/09/saving-the-world-from-code/540393/)

Looking at industry trends more tooling is developed to mitigate YOLO clouds (e.g. end-to-end encryption in Istio to get around insecure networks) then tooling to improve cloud operations and/or uptime. So, at least for now, YOLO cloud has won and we live in a YOLO cloud world ;-)

Conclusion

Since in my own life I have been woken up too many times in the middle of the night for CVE mitigation and service outages which makes the YOLO cloud, though liberating, a novel concept I am grappling with. Also why does a galera cluster DB not run at 99.999? Yolo!