DevOps at trivago

Streamlining GraphQL Service Testing with Karate

Over the last year trivago refactored the existing GraphQL monolith and moved to a microservice architecture, in what is also known as a federated setup. Federated GraphQL, as championed by Apollo, represents a significant shift in how companies architect and scale their GraphQL ecosystems, especially when transitioning from monolithic designs to microservice-based architectures. In a federated setup, each microservice maintains its own GraphQL schema, relevant to the domain it serves. These individual schemas are then seamlessly stitched together into a unified gateway. This gateway serves as the single access point for clients, enabling them to query data across multiple services as if it were coming from a single, monolithic GraphQL API. Such change involves also a completely new approach to the test and delivery, with the goal of empowering developers to quickly and safely deploy to production autonomously.

Armin Aminian Giuseppe Donati · 8 Jul 2024 · 13 min read

Implementing Data Validation with Great Expectations in Hybrid Environments

Data validation is an essential step in any data processing pipeline, as it ensures the integrity and accuracy of the data to be used across all subsequent processing steps. Great Expectations (GX) is an open-source framework that provides a flexible and efficient way to perform data validation, allowing data scientists and analysts to quickly identify and correct any issues with their data. In this article, we share our experience implementing Great Expectations for data validation in our Hadoop environment, and our take on its benefits and limitations.

Kamila Widyanto · 25 Apr 2023 · 6 min read

How we scaled our Prometheus setup

In 2020 we started to migrate one of our most significant workloads, our Node.js based GraphQL API and many of its microservices, from our datacenter to Google Kubernetes Engine. We deploy it in three GCP regions, each having its Kubernetes cluster. Since then, our monitoring infrastructure has changed due to various periods of instability and pandemic induced scaling challenges.

Simon Brüggen · 23 Aug 2022 · 9 min read

SRE: On-Call Procedure at trivago

One of the many responsibilities of a Site Reliability Engineer (SRE), is to ensure uptime, availability and in some cases, consistency of the product. In this context, the product refers to the website, APIs, microservices, and servers. This responsibility of keeping the product up and running becomes particularly interesting if the product is used around the world 24 hours every day like trivago. And just like in the medical profession, someone has to be on call to react on failures and outages outside of the office hours.

Kenechukwu Nnamani · 18 Jul 2022 · 7 min read

Being on-call as a software engineer - a challenging and fast learning experience

At trivago, we run webservices with complex backends in different regions around the globe 24/7. Our system is being iterated and developed on a daily basis. Naturally, mistakes will be made and something will break eventually. Engineers being on-call are the first responders to issues with negative impact on our users and the business.

Stefan Nothaas · 12 Jan 2022 · 8 min read

Proper (Java) application life cycle management in Kubernetes

When operating applications in Kubernetes, proper lifecycle management is crucial to enable Kubernetes to manage applications correctly throughout their different phases: startup, runtime and shutdown. Improper or incomplete lifecycle management can lead to incidents with unforeseen and difficult to debug application behavior, such as random CrashLoopBackOffs, broken/zombie services not being restarted or even entire services not becoming healthy after a scheduled restart.

Stefan Nothaas Lars Heß · 9 Jun 2021 · 8 min read

Google Cloud Workload-Placement-Guide

At trivago we operate a hybrid infrastructure of both on-premise machines and clusters on Google Cloud. Over time, we came up with a set of deployment guidelines for running our workloads as more and more of them are migrating to Google Cloud. These are not strict rules, but rather suggestions to best serve each team's needs.

Arne Claus · 17 Jul 2020 · 10 min read

Cross-Cluster Traffic Mirroring with Istio

The price of reliability is the pursuit of the utmost simplicity.
— C.A.R. Hoare, Turing Award lecture

Have you ever enthusiastically released a new, delightful version to production and then suddenly started hearing a concerning number of notification sounds? Gets your heart beating right? After all, you didn't really expect this to happen because it worked in the development environment.

Mert Acikportali · 10 Jun 2020 · 10 min read

ElasticWars Episode IV: A new field

On a normal day, we ingest a lot of data into our ELK clusters (~6TB across all of our data centers). This is mostly operational data (logs) from different components in our infrastructure. This data ranges from purely technical info (logs from our services) to data about which pages our users are loading (intersection between business and technical data).

Jorge Luis Betancourt · 3 Jun 2020 · 8 min read

trivago joins the Cloud Native Computing Foundation

Last year, when visiting CloudNativeCon/KubeCon Europe in Barcelona (one of the biggest cloud-focused conferences in Europe), I noticed that there were some companies present in the exhibition space whose primary focus wasn't software development. I was surprised to see companies from finance to sportswear as Cloud Native Computing Foundation (CNCF) sponsors. There I discovered various CNCF membership types and learned about the End User Supporter membership.

Daniel Kleuser · 7 Apr 2020 · 3 min read

Why We Chose Go

To the outside, trivago appears to be one single software product providing our popular hotel meta search. Behind the scenes, however, it is home to dozens of projects and tools to support it. Teams are encouraged to choose the programming languages and frameworks that will get the job done best. Only few restrictions are placed on the teams in these decisions, primarily long-term maintainability. As a result, trivago has a largely polyglot code base that fosters creativity and diverse thinking. It allows us to make informed decisions based on actual requirements rather than legacy code or antiquated projects.

Martin Mai · 2 Mar 2020 · 7 min read

Makefiles in 2019 — Why They Still Matter

Make was created in 1976 by Stuart Feldman at Bell Labs to help build C programs. But how can this 40+ year old piece of software help us develop and maintain our ever-growing amount of cloud-based microservices?

Simon Brüggen · 20 Dec 2019 · 10 min read

Nomad - our experiences and best practices

Hello from trivago's performance & monitoring team. One important part of our job is to ship more than a terabyte of logs and system metrics per day, from various data sources into elasticsearch, several time series databases and other data sinks. We do so by reading most of the data from multiple Kafka clusters and processing them with nearly 100 Logstashes. Our clusters currently consists of ~30 machines running Debian 7 with bare-metal installations of the aforementioned services. This summer we decided to migrate all of this to an on-premise [Nomad](https://www.nomadproject.io/ cluster) cluster.

Inga Feick · 25 Jan 2019 · 10 min read

Building fast and reliable web applications

Test, test, test. If you don’t, an issue is bound to crop up in production sooner or later.

We’ve all heard this mantra in one form or another. The importance of testing your software has been covered by countless articles, books and conferences. You worked hard on your code coverage and your downtime due to regression-related bugs has severely decreased.

Frank van Gemeren · 12 Oct 2018 · 10 min read

Automate and Encourage! The New Tech Blog Deployment Process

We do think that our tech blog is full of interesting things powered by our engineers' great stories. Let me take you on a journey of how we maintain the trivago tech blog from the technical perspective and how we recently automated its deployment process.

Busra Koken · 7 Dec 2017 · 6 min read

DevOps at trivago

Streamlining GraphQL Service Testing with Karate

Implementing Data Validation with Great Expectations in Hybrid Environments

How we scaled our Prometheus setup

SRE: On-Call Procedure at trivago

Being on-call as a software engineer - a challenging and fast learning experience

Proper (Java) application life cycle management in Kubernetes

Google Cloud Workload-Placement-Guide

Cross-Cluster Traffic Mirroring with Istio

ElasticWars Episode IV: A new field

trivago joins the Cloud Native Computing Foundation

Why We Chose Go

Makefiles in 2019 — Why They Still Matter

Nomad - our experiences and best practices

Building fast and reliable web applications

Automate and Encourage! The New Tech Blog Deployment Process

Popular tags

Featured articles

Implementing Data Validation with Great Expectations in Hybrid Environments

How we scaled our Prometheus setup

Being on-call as a software engineer - a challenging and fast learning experience

Java Reactive Programming - Effective Usage in a Real World Application

Learn Redis the hard way (in production)

trivago tech newsletter

Popular tags

Featured articles

Career? trivago.