In 2020 we started to migrate one of our most significant workloads, our Node.js based GraphQL API and many of its microservices, from our datacenter to Google Kubernetes Engine. We deploy it in three GCP regions, each having its Kubernetes cluster. Since then, our monitoring infrastructure has changed due to various periods of instability and pandemic induced scaling challenges.
When was the last time you booked accommodation without checking its photos? Most probably never! Because having imagery information makes our decision-making process much easier and faster. However, picking up the best possible images of a hotel to show to the user is an interesting problem to solve, because it can be a naive random selection or a sophisticated machine learning model to know what the user truly wants at that moment.
When operating applications in Kubernetes, proper lifecycle management is crucial to enable Kubernetes to manage applications correctly throughout their different phases: startup, runtime and shutdown. Improper or incomplete lifecycle management can lead to incidents with unforeseen and difficult to debug application behavior, such as random CrashLoopBackOffs, broken/zombie services not being restarted or even entire services not becoming healthy after a scheduled restart.
At trivago we operate a hybrid infrastructure of both on-premise machines and clusters on Google Cloud. Over time, we came up with a set of deployment guidelines for running our workloads as more and more of them are migrating to Google Cloud. These are not strict rules, but rather suggestions to best serve each team's needs.
Have you ever enthusiastically released a new, delightful version to production and then suddenly started hearing a concerning number of notification sounds? Gets your heart beating right? After all, you didn't really expect this to happen because it worked in the development environment.
Last year, when visiting CloudNativeCon/KubeCon Europe in Barcelona (one of the biggest cloud-focused conferences in Europe), I noticed that there were some companies present in the exhibition space whose primary focus wasn't software development. I was surprised to see companies from finance to sportswear as Cloud Native Computing Foundation (CNCF) sponsors. There I discovered various CNCF membership types and learned about the End User Supporter membership.
At trivago, we have several workflows which interact with external services. The health and availability of external services can have an impact on keeping our workflows alive and responsive. Think of an API call made to an external service which is down. Our workflows have to be prepared to expect these errors and adapt to it.