During the development of customer-facing applications, time is crucial, especially when it comes to testing and analyzing changes before accepting them in production. This blog post explores how we developed a Java-based reactive tool to simulate production requests, that allows us to have quicker hints about the effects of changes introduced and be more confident about the hypotheses that are formulated. As a long term vision, we wish to significantly reduce the A/B testing time and ensure seamless transitions.
trivago provides travelers with an extensive collection of hotels, empowering them to compare prices and uncover the best vacation deals. With so many exceptional options available, we have introduced a new feature called "Favorites" to streamline the navigation process. This feature enables users to effortlessly save their preferred accommodations and access them later, ensuring ease of use. To access this feature, visit https://www.trivago.com/en-US/favorites.
Back in March 2022, after spending a considerable amount of effort migrating our monolithic Node.js GraphQL server from Express to Fastify, we noticed absolutely no performance improvements in production. That hit us like a bombshell, especially because Fastify performed exceptionally well in our
k6 load tests in staging, where it responded to HTTP requests 107% (more than two times) faster on average than Express!
As I’m writing this, we’re in the middle of our yearly load testing process.
Since a couple of years now, trivago conducts regular production load tests. We do this to test if all our services sustain the increased load we experience during the summer and winter months.
This year is now the 2nd time, where we also do another test: A "regional failover" test.
Scalability and availability are key aspects of cloud native computing. If your microservice takes five minutes to start up, it becomes very difficult to meet the expectations because adjustments to traffic changes, regional failovers, hot-fixes and rollbacks are simply too slow. In this article, we show how we solved this and a few other problems by taking control of the process of updating our data and storing it in a highly available Redis setup.
At trivago we operate on petabytes of data. In live-traffic applications that are related to the bidding business cases we use our in-house in-memory key-value storage-service written in Java to keep data as close to calculation logic as possible.
When was the last time you booked accommodation without checking its photos? Most probably never! Because having imagery information makes our decision-making process much easier and faster. However, picking up the best possible images of a hotel to show to the user is an interesting problem to solve, because it can be a naive random selection or a sophisticated machine learning model to know what the user truly wants at that moment.
When operating applications in Kubernetes, proper lifecycle management is crucial to enable Kubernetes to manage applications correctly throughout their different phases: startup, runtime and shutdown. Improper or incomplete lifecycle management can lead to incidents with unforeseen and difficult to debug application behavior, such as random CrashLoopBackOffs, broken/zombie services not being restarted or even entire services not becoming healthy after a scheduled restart.
This article presents how trivago's search backend team used reactive programming in Java effectively when designing and implementing one of our many Java backend services. Compared to traditional imperative and functional programming, reactive programming requires a mindset-shift in order to apply the concepts and techniques effectively. The benefits we gain support us in some key challenges that every engineer is facing with essentially every (micro-) service in today’s backend architectures: handling of blocking IO, backpressure, managing highly varying loads as well as message and error propagation.
In the trivago backend, we use the reactive programming pattern for fetching prices from advertisers and updating our caches. This helps us to increase the responsiveness (i.e., scalability and resilience) of our backend. Thus, our backend system can alleviate high response times from internal components and our advertisers while staying responsive, even if downstream components fail entirely. Here is how we use the Java library Reactor Core to ensure those guarantees:
Metrics are one of the main building blocks in the topic of observability.
Hence, we have a lot of metrics within our applications and especially for the connections between our applications. Every outgoing request has its latency measured and we also record the sizes of the request and the response. These numbers are collected in histograms and based on that data, in our Grafana graphs, we create corresponding graphs that show us e.g. the median size of request- and response payloads or the 99th percentile of call durations.
In our new series, trivago Tech Check-in, we're introducing you to some of our tech talents from across the globe who help keep our metasearch engine running smoothly everyday. In this first edition, you'll meet Fabian Fritzsche, an engineering intern that works on the Microservice-System that feeds our GraphQL API with up-to-date hotel data.
At trivago we operate a hybrid infrastructure of both on-premise machines and clusters on Google Cloud. Over time, we came up with a set of deployment guidelines for running our workloads as more and more of them are migrating to Google Cloud. These are not strict rules, but rather suggestions to best serve each team's needs.
The price of reliability is the pursuit of the utmost simplicity.
— C.A.R. Hoare, Turing Award lecture
Have you ever enthusiastically released a new, delightful version to production and then suddenly started hearing a concerning number of notification sounds? Gets your heart beating right? After all, you didn't really expect this to happen because it worked in the development environment.
On a normal day, we ingest a lot of data into our ELK clusters (~6TB across all of our data centers). This is mostly operational data (logs) from different components in our infrastructure. This data ranges from purely technical info (logs from our services) to data about which pages our users are loading (intersection between business and technical data).