Imagine, out of thousands of accommodations that match a user search, you have to select the “best” 25 to show to the user. Which ones would you show- the ones you know perform well or ones that have never been shown before, so that you can discover new high-potential accommodations? In the Data Science world, this is known as exploitation (continue doing what works well) versus exploration (try something new to discover hidden potential) problem and is often explained using the well-known multi-armed bandit problem. The objective of the problem is to divide a fixed number of resources between competing choices to maximize their expected gains, given that the properties of each choice are not fully known at the time of allocation.
In 2020 we started to migrate one of our most significant workloads, our Node.js based GraphQL API and many of its microservices, from our datacenter to Google Kubernetes Engine. We deploy it in three GCP regions, each having its Kubernetes cluster. Since then, our monitoring infrastructure has changed due to various periods of instability and pandemic induced scaling challenges.