Read How to substantially slow down your Node.js server

How to substantially slow down your Node.js server

Back in March 2022, after spending a considerable amount of effort migrating our monolithic Node.js GraphQL server from Express to Fastify, we noticed absolutely no performance improvements in production. That hit us like a bombshell, especially because Fastify performed exceptionally well in our k6 load tests in staging, where it responded to HTTP requests 107% (more than two times) faster on average than Express!

Read the whole article ›
Read Powering ML-Based Systems With Reliable Data: The Data Annotation Journey

Powering ML-Based Systems With Reliable Data: The Data Annotation Journey

In the last few years, organisations have been increasing their investments in building Machine Learning (ML) based systems. In practice, such systems often took longer than expected to be built or failed to deliver the promised outcome. Data availability and quality have been among the most significant reasons behind this phenomenon. Since each organisation had its custom problems, open datasets or even logged data were not always directly usable. As a result, data collection and annotation processes became more crucial, yet remained under-documented.

Read the whole article ›
Read How we scaled our Prometheus setup

How we scaled our Prometheus setup

In 2020 we started to migrate one of our most significant workloads, our Node.js based GraphQL API and many of its microservices, from our datacenter to Google Kubernetes Engine. We deploy it in three GCP regions, each having its Kubernetes cluster. Since then, our monitoring infrastructure has changed due to various periods of instability and pandemic induced scaling challenges.

Read the whole article ›
Read How to Survive a Regional Outage

How to Survive a Regional Outage

As I’m writing this, we’re in the middle of our yearly load testing process.
Since a couple of years now, trivago conducts regular production load tests. We do this to test if all our services sustain the increased load we experience during the summer and winter months.
This year is now the 2nd time, where we also do another test: A "regional failover" test.

Read the whole article ›
Read SRE: On-Call Procedure at trivago

SRE: On-Call Procedure at trivago

One of the many responsibilities of a Site Reliability Engineer (SRE), is to ensure uptime, availability and in some cases, consistency of the product. In this context, the product refers to the website, APIs, microservices, and servers. This responsibility of keeping the product up and running becomes particularly interesting if the product is used around the world 24 hours every day like trivago. And just like in the medical profession, someone has to be on call to react on failures and outages outside of the office hours.

Read the whole article ›
Read How we got on top of our data

How we got on top of our data

Scalability and availability are key aspects of cloud native computing. If your microservice takes five minutes to start up, it becomes very difficult to meet the expectations because adjustments to traffic changes, regional failovers, hot-fixes and rollbacks are simply too slow. In this article, we show how we solved this and a few other problems by taking control of the process of updating our data and storing it in a highly available Redis setup.

Read the whole article ›
Read Improving Evaluation Practices in Natural Language Generation

Improving Evaluation Practices in Natural Language Generation

Throughout last year I had the opportunity to participate and collaborate on multiple research initiatives in the field of Natural Language Generation (NLG) in addition to my responsibilities as a Data Scientist at trivago. NLG is the process of automatically generating text from either text and/or non-linguistic data inputs. Some NLG applications include chatbots, image captioning, and report generation. These are application areas of high interest internally within trivago as we seek to leverage our rich data environment to enrich the user experience with potential NLG applications.

Read the whole article ›

We're Hiring

Tackling hard problems is like going on an adventure. Solving a technical challenge feels like finding a hidden treasure. Want to go treasure hunting with us?