If you are a site reliability engineer (SRE) for a large Kubernetes-powered application, optimizing resources and performance is a daunting job. Some spikes, like a busy shopping day, are things you can broadly schedule, but, if done right, would require painstakingly understanding the behavior of hundreds of microservices and their interdependence that has to be re-evaluated with each new release — not a very scalable approach, let alone the monotony and resulting stress to the SRE. Moreover, there will always be unexpected peaks to respond to. Continually keeping tabs on performance and putting the optimal amount of resources in the right place is essentially impossible.
The way this is being solved now is through gross overprovisioning, or a combination of guesswork and endless alerts — requiring support teams to review and intervene. It’s simply not sustainable or practical, and certainly not scalable. But it’s just the kind of problem that machine learning and AI thrives on. We have spent the last decade dealing with such problems, and the arrival of the latest generation of AI tools such as generative AI has opened the possibility of applying machine learning to the real problems of the SRE to realize the promise of AIOps.