Latest articles

  • Read Marketing Attribution: Evaluating The Path to Purchase in the Product Ecosystem

    Marketing Attribution: Evaluating The Path to Purchase in the Product Ecosystem

    While working with data and analyzing the interactions of our users with the products we have today, it is essential to understand their behaviors by tracking their past actions, such as opening notifications, interacting with a blog, or creating a new login in the platform. In that context, the attribution study refers to the method of grouping together all of those actions in a specific pattern to generate one desired end result.

  • Read Explore-exploit dilemma in Ranking model

    Explore-exploit dilemma in Ranking model

    Imagine, out of thousands of accommodations that match a user search, you have to select the “best” 25 to show to the user. Which ones would you show- the ones you know perform well or ones that have never been shown before, so that you can discover new high-potential accommodations? In the Data Science world, this is known as exploitation (continue doing what works well) versus exploration (try something new to discover hidden potential) problem and is often explained using the well-known multi-armed bandit problem. The objective of the problem is to divide a fixed number of resources between competing choices to maximize their expected gains, given that the properties of each choice are not fully known at the time of allocation.

  • Read How to substantially slow down your Node.js server

    How to substantially slow down your Node.js server

    Back in March 2022, after spending a considerable amount of effort migrating our monolithic Node.js GraphQL server from Express to Fastify, we noticed absolutely no performance improvements in production. That hit us like a bombshell, especially because Fastify performed exceptionally well in our k6 load tests in staging, where it responded to HTTP requests 107% (more than two times) faster on average than Express!

  • Read Powering ML-Based Systems With Reliable Data: The Data Annotation Journey

    Powering ML-Based Systems With Reliable Data: The Data Annotation Journey

    In the last few years, organisations have been increasing their investments in building Machine Learning (ML) based systems. In practice, such systems often took longer than expected to be built or failed to deliver the promised outcome. Data availability and quality have been among the most significant reasons behind this phenomenon. Since each organisation had its custom problems, open datasets or even logged data were not always directly usable. As a result, data collection and annotation processes became more crucial, yet remained under-documented.

  • Read How we scaled our Prometheus setup

    How we scaled our Prometheus setup

    In 2020 we started to migrate one of our most significant workloads, our Node.js based GraphQL API and many of its microservices, from our datacenter to Google Kubernetes Engine. We deploy it in three GCP regions, each having its Kubernetes cluster. Since then, our monitoring infrastructure has changed due to various periods of instability and pandemic induced scaling challenges.

  • Read How to Survive a Regional Outage

    How to Survive a Regional Outage

    As I’m writing this, we’re in the middle of our yearly load testing process.
    Since a couple of years now, trivago conducts regular production load tests. We do this to test if all our services sustain the increased load we experience during the summer and winter months.
    This year is now the 2nd time, where we also do another test: A "regional failover" test.

  • Read SRE: On-Call Procedure at trivago

    SRE: On-Call Procedure at trivago

    One of the many responsibilities of a Site Reliability Engineer (SRE), is to ensure uptime, availability and in some cases, consistency of the product. In this context, the product refers to the website, APIs, microservices, and servers. This responsibility of keeping the product up and running becomes particularly interesting if the product is used around the world 24 hours every day like trivago. And just like in the medical profession, someone has to be on call to react on failures and outages outside of the office hours.

  • Read How we got on top of our data

    How we got on top of our data

    Scalability and availability are key aspects of cloud native computing. If your microservice takes five minutes to start up, it becomes very difficult to meet the expectations because adjustments to traffic changes, regional failovers, hot-fixes and rollbacks are simply too slow. In this article, we show how we solved this and a few other problems by taking control of the process of updating our data and storing it in a highly available Redis setup.

Open Source Projects