One of the many responsibilities of a Site Reliability Engineer (SRE), is to ensure uptime, availability and in some cases, consistency of the product. In this context, the product refers to the website, APIs, microservices, and servers. This responsibility of keeping the product up and running becomes particularly interesting if the product is used around the world 24 hours every day like trivago. And just like in the medical profession, someone has to be on call to react on failures and outages outside of the office hours.
At trivago we store a subset of our realtime metric data in InfluxDB and we are quite impressed by the load it can handle. Despite all the joy, we had to learn some lessons the hard way. It is pretty easy to overload the database or the web browser by executing queries that return too many datapoints. To prevent that, we wrote Protector - a circuit breaker for Time series databases that blocks malicious queries.
At trivago we love hotels above everything else, but we also like metrics, we love to measure everything, compare, decide, improve and then rinse and repeat. In this blog entry we are going to describe our experience with InfluxDB, a time series database that we are using to store some real time metrics.