Metrics are one of the main building blocks in the topic of observability and we use them heavily. This story is about an incident where we tried to find and resolve a problem that we saw in these metrics. We went down a rabbit hole of potential fixes, only to discover that the metrics were correct all along.
Posts about Monitoring
At trivago, we generate a huge amount of logs and we have our own custom setup for shipping logs using mostly Protocol Buffers. Eventually we end up with some fields in Elasticsearch (ES) that contain partial (or full) URLs. For instance, in our specific case we store the query component of the URL in a field called query and the path component in a field named url_path. Sample values for these fields could be:
tl;dr: continuously monitor your CDN and origin servers on layer 3 with tools like MTR. Layer 3 issues on external middleware can have a significant impact on layer 7 web performance. In a recent rollout of a new cloud service, we monitored the impact of this service on web performance, UX and business metrics. For all cloud regions and origin servers, we had Synthetic and Real User Monitoring for our site in place.
Hello from trivago’s performance & monitoring team. One important part of our job is to ship more than a terabyte of logs and system metrics per day, from various data sources into elasticsearch, several time series databases and other data sinks. We do so by reading most of the data from multiple Kafka clusters and processing them with nearly 100 Logstashes. Our clusters currently consists of ~30 machines running Debian 7 with bare-metal installations of the aforementioned services.
Ever heard about Microservices? Those tiny litte pieces of code that are used to split a big pile of magic into smaller pieces of magic? Well, they’re not that tiny after all and require lots of preliminary work to use them properly. Have a look at this post to hear about my journey of splitting an existing monolith written in PHP up into several microservices written in Go.
We were not as happy as we could be with out Cucumber test reporting solution - so we decided to build a new and shiny one from scratch.
We’re a data-driven company. At trivago we love measuring everything. Collecting metrics and making decisions based on them comes naturally to all our engineers. This workflow also applies to performance, which is key to succeed in the modern Internet.
At trivago we store a subset of our realtime metric data in InfluxDB and we are quite impressed by the load it can handle. Despite all the joy, we had to learn some lessons the hard way. It is pretty easy to overload the database or the web browser by executing queries that return too many datapoints.
At trivago we rely heavily on the ELK stack for our log processing. We stream our webserver access logs, error logs, performance benchmarks and all kind of diagnostic data into Kafka and process it from there into Elasticsearch using Logstash.
Tackling hard problems is like going on an adventure. Solving a technical challenge feels like finding a hidden treasure. Want to go treasure hunting with us?View all current job openings