In 2020 we started to migrate one of our most significant workloads, our Node.js based GraphQL API and many of its microservices, from our datacenter to Google Kubernetes Engine. We deploy it in three GCP regions, each having its Kubernetes cluster. Since then, our monitoring infrastructure has changed due to various periods of instability and pandemic induced scaling challenges.
Monitoring at trivago
Insights, experiences and learnings from trivago's tech teams.
At trivago, we generate a huge amount of logs and we have our own custom setup for shipping logs using mostly Protocol Buffers. Eventually we end up with some fields in Elasticsearch (ES) that contain partial (or full) URLs. For instance, in our specific case we store the query component of the URL in a field called
query and the path component in a field named
url_path. Sample values for these fields could be:
tl;dr: continuously monitor your CDN and origin servers on layer 3 with tools like MTR. Layer 3 issues on external middleware can have a significant impact on layer 7 web performance.
Hello from trivago's performance & monitoring team. One important part of our job is to ship more than a terabyte of logs and system metrics per day, from various data sources into elasticsearch, several time series databases and other data sinks. We do so by reading most of the data from multiple Kafka clusters and processing them with nearly 100 Logstashes. Our clusters currently consists of ~30 machines running Debian 7 with bare-metal installations of the aforementioned services. This summer we decided to migrate all of this to an on-premise [Nomad](https://www.nomadproject.io/ cluster) cluster.