Why and how we use primitive maps

Intro

At trivago we operate on petabytes of data. In live-traffic applications that are related to the bidding business cases we use our in-house in-memory key-value storage-service written in Java to keep data as close to calculation logic as possible.

Having such data stores in public clouds could be quite costly in terms of money and performance, especially if you use applications written in languages with garbage collection mechanics, such as Java. But is there any way how to optimize the performance of such storage-services?

Let’s consider a few use-cases and optimization techniques and show why we use a set of concurrent wrappers around primitive maps.

Bits packaging

Imagine there is a simple Java-record:

public record Bid (int sourceId,
                   int groupId,
                   int bidA,
                   int bidB,
                   int bidC,
                   int campaignId) {
}

How many bits are needed to store a single record of Bid-object? Let’s calculate. There are 6 integer fields, 32 bits each and 128 bits (16 bytes in 64 bit JVM builds) for Java-object header. In total, we have at least 6*32 + 128 = 320 bits.

Note: Project Lilliput proposes to reduce the Java object header by half or more.

But what if it’s definitely known that each field of Bid-object has a certain cardinality? For example, the maximum value of bidA, bidB, bidC is not greater than 4096 (2^12), groupId and campaignId both have cardinality 64 (2^6) and sourceId — cardinality 16384 (2^14). That means, we could consider that our Bid-record uses only 12 _ 3 + 6 _ 2 + 14 = 62 bits as the payload.

Considering we store not just a few, but billions of such records, let’s compare the memory requirements of each storage scheme:

Number of records	Approximate amount (Bid-record)	Approximate amount (payload bits)
1	~ 320 bits	~ 62 bits
1_000	~ 40 Kb	~ 8 Kb
1_000_000_000	~ 37.2 Gb	~ 7.2 Gb

It shows that, by having a big data set, we could optimize our memory a lot (~ 80%). But at first we have to pack the bits into some lighter (in terms of memory) data structure.

As you see above, the payload requires only 62 bits, that could be fit in a long (int64) primitive:

table

Now, let’s try to encode a Bid to a long primitive. The idea is pretty simple: Step by step, we encode every Bid-field into bits, apply int64 bit mask and shift the new value using its own cardinality. For encoding we use a generic BitmaskEncoder:

public class BitmaskEncoder {

    private static final long[] MASK_64 = buildMask64();

    private final int[] fieldLengths;

    public BitmaskEncoder(int... fieldLengths) {
        this.fieldLengths = fieldLengths;
    }
    public long encode64(int... fields) {

        long encodedValue = 0;

        int shiftStep = 0;

        int fieldIndex = 0;

        do {
            // take a field
            int nextField = fields[fieldIndex];
            // take the appropriate field length
            int fieldLength = this.fieldLengths[fieldIndex];
            // take the appropriate mask
            long fieldMask = MASK_64[fieldLength];

            // applying the mask to the field,
            // shifting it and adding to the final value
            encodedValue = encodedValue | ((nextField & fieldMask) << shiftStep);

            shiftStep = shiftStep + fieldLength;

            fieldIndex++;
        } while (fieldIndex < fields.length);

        return encodedValue;
    }

    private static long[] buildMask64() {
        int maxBits = 64;

        long[] masks = new long[maxBits];

        for (int i = 0; i < maxBits - 1; i++) {
            masks[i] = (long) Math.pow(2, i) - 1;
        }

        masks[maxBits - 1] = (long) Math.pow(2, maxBits);

        return masks;
    }
}

Here’s an example of how it could work for our particular case:

int numFields = 6;

int[] fieldLengths = new int[numFields];
fieldLengths[0] = 14; // sourceId
fieldLengths[1] = 6; // groupId
fieldLengths[2] = 12; // bidA
fieldLengths[3] = 12; // bidB
fieldLengths[4] = 12; // bidC
fieldLengths[5] = 6; // campaignId

var encoder = new BitmaskEncoder(fieldLengths);

Finally, let’s see how we can encode some record into long value:

var bid = new Bid(10, 4, 1, 2, 3, 8);
long encodedValue = encoder.encode64(
        bid.sourceId(),
        bid.groupId(),
        bid.bidA(),
        bid.bidB(),
        bid.bidC(),
        bid.campaignId());

What we use for storing primitives

Now we have our bit-packaging algorithm and we want to be able to store our records in some collection.

We could consider standard ConcurrentHashMap<K,V> , but it requires storing key and value both as reference types. This means that, even if we pack our bits into a long value, we would have to store this value in a Long wrapper. This brings with it an additional overhead of at least 128 bits for the Java object header, which we tried so hard to eliminate.

Fortunately, there are several open-source libraries, that offer storing keys and values which allow storing keys and values as primitive values, without boxing:

In production, we use fastutil, because, based on our internal benchmarks, it shows better performance for read and write operations.

The original fastutil library supports the concurrent versions of maps based on java.util wrappers, which has the following definition:

private static class SynchronizedMap<K,V> implements Map<K,V>, Serializable {
...
    @SuppressWarnings("serial") // Conditionally serializable
    private final Map<K,V> m;   // Backing Map
    @SuppressWarnings("serial") // Conditionally serializable
    final Object mutex;         // Object on which to synchronize

    SynchronizedMap(Map<K,V> m) {
        this.m = Objects.requireNonNull(m);
        mutex = this;
    }
...
    public V get(Object key) {
        synchronized (mutex) {return m.get(key);}
    }

    public V put(K key, V value) {
        synchronized (mutex) {return m.put(key, value);}
    }
...

You can see that there is a single mutex for the whole map. Such an approach could be very costly in terms of latency of calculation. That’s why we implemented and open sourced our own concurrent wrapper over fastutil maps — Fastutil Concurrent Wrapper.

It works with striped ReentrantReadWriteLock, which gives better performance on reads and holds the lock only per one bucket. There are two different locks mode:

standard striped Lock,
striped lock with busy-waiting acquiring strategy.

In situations where you need to have the low latency processing, the second mode, BusyWaitingMode, can provide better throughput and average time per operation, avoiding calling LockSupport.park().

Let’s look at a part of the BusyWaitingMode implementation:

public class ConcurrentBusyWaitingLongLongMap extends PrimitiveConcurrentMap implements LongLongMap {

    private final LongLongMap[] maps;
    private final long defaultValue;
...
    @Override
    public long get(long key) {
        int bucket = getBucket(key);

        Lock readLock = locks[bucket].readLock();

        while (true) {
            if (readLock.tryLock()) {
                try {
                    return maps[bucket].get(key);
                } finally {
                    readLock.unlock();
                }
            }
        }
    }

    @Override
    public long put(long key, long value) {
        int bucket = getBucket(key);

        Lock writeLock = locks[bucket].writeLock();

        while (true) {
            if (writeLock.tryLock()) {
                try {
                    return maps[bucket].put(key, value);
                } finally {
                    writeLock.unlock();
                }
            }
        }
    }
...

Note: right now, Fastutil Concurrent Wrapper does not support the concurrent version for every primitive map from original fastutil library. Contributions are welcome!

Benchmarks

Now, let’s consider performance comparison between primitives maps and standard Java maps.

For benchmarking, we compared 3 instances with 3 different storage variants each:

Bit-packaged variant with primitive concurrent fastutil wrapper long-long map (optimized),
Bit-packaged variant with ConcurrentHashMap<Long,Long> (optimized-ref),
A simple standard Java ConcurrentHashMap<Long, Bid> map (not-optimized).

Notes:

before the benchmarks were run, every instance was deployed and contained ~150 million <long, long> records and ~1 GB of an additional data. For the additional data there were no writes/reads.
we ran our instances on Java 16 (openjdk) and with -XX:AutoBoxCacheMax=2700, because there are a lot of relatively small integers.

Writes

Benchmarks were done during one of our heavily-writes period, with ~ 300k max writes/sec for every considered instance:

events

During the writes period, incoming events were writing new key-value pairs and overwriting existing pairs.

Memory

memory

Consider the memory occupation in Old Gen we can see, that our optimized maps occupy only 6.6 GiB max, while other variants with ConcurrentHashMap 13.3 GiB and 16.9 GiB respectively.

The reason for such a big difference is having keys and values as primitives, without any Java-object header overhead and avoiding Java-records.

CPU and GC

Now we check what overhead both Java-object header and structures give to CPU and GC usage:

cpu

The picture shows how much CPU was burnt for allocating a new piece memory for records as well as GC-work. Interestingly, that not-optimized version has used less CPU compare to optimized-ref. The explanation is: for packaging, the service had to burn an additional CPU, when not-optimized just allocated a piece of memory for records and put it into the map. But most important is: our optimized primitive maps version has burnt much less CPU: ~4.5% vs ~28% vs ~61%.

Equally, interesting results for GC (young generation): the Garbage Collector of the optimized version ran for ~400 microseconds, while the other versions ran ~55-60 milliseconds (even with -XX:AutoBoxCacheMax). The difference arises mainly because of large allocations during write operations.

Conclusion

There are a lot of different ways of optimizing your live-traffic application, which could save money and latency. We have considered options for optimizing stored values with certain cardinalities. You could use bits-packaging and Fastutil Concurrent Wrapper, as well as some JVM flags like -XX:AutoBoxCacheMax.

Nevertheless, trivago offers many more interesting challenges to be solved. Check out our open positions if you’re keen to come solve puzzles with us.

Why and how we use primitive maps

Intro

Bits packaging

What we use for storing primitives

Benchmarks

Writes

Memory

CPU and GC

Conclusion

Author Info

Article tags

Related Articles

Accelerating experimentations through Simulations

Building Our First GraphQL Server with Go: An Implementation Guide

Technical Decision-Making

How to substantially slow down your Node.js server

Intro

Bits packaging

What we use for storing primitives

Benchmarks

Writes

Memory

CPU and GC

Conclusion

trivago tech newsletter

Author Info

Article tags

Related Articles

Career? trivago.