My 2 Cents: I'll gladly spend them to stop staring at test logs

My 2 Cents: I'll gladly spend them to stop staring at test logs

The problem

Does 99.8% represent a good success rate in end-to-end test automation of a web application? What about 99.9%? Maybe yes, but when you have a test suite that has up to 1,000 scenarios, it means often having one or two failing tests, due to the well known issue of test flakiness. Maybe even more than that, when the environment has a bad day.

Our preview and stage systems have high availability, but they are obviously not as reliable and performant as production systems, as for example many services might have some increased latency in their responses. So when you trigger your end-to-end suite, and three tests fail, now you have to figure out: Is this my fault? Was it already broken? Is it flaky? We have good reporting and monitoring in place, but nevertheless some time and cognitive load are always needed to scan Kibana dashboards, dig through daily flakiness reports in Slack and cross-reference branch histories manually. I would definitely spend 2 cents to outsource the assessment of such failures.

Meet the test suite

Our end-to-end test suite in the Continuous Integration pipeline of our main product consists of two main layers. The first is a small and hardened set of core scenarios, “trupi-core” tests. Trupi is the name of our internally developed, Selenium-based framework. The core test suite runs on every single commit and usually completes in just a couple of minutes. Beyond that, we have a much larger suite called “trupi-extended”, invoked on demand via pull request comments in GitHub. This second suite runs across three sub-configurations: desktop browsers, mobile browser emulation, and a server-side rendering variant that tests SEO-critical paths with JavaScript disabled.

The extended suite covers a wide range of functionalities and faces an assorted set of challenges: inherent test flakiness, live data updates from underlying services, transient microservice latency in preview environments, and constant churn from feature variants being accepted and becoming the new default. The test success rate stays stable above 99% if you measure at the individual scenario level, but at the workflow run level it drops to a grim 20%. In other words, four out of five times someone triggers the extended suite, at least one failure will need to be investigated. Multiply that for at least 40 runs a day and we have to do this failure analysis activity more than 30 times a day. That is the frequency of the problem we set out to mitigate.

Kibana test results dashboard

An AI-CI-detective: how it works

With the recent advancements in the quality of code-related AI output, we identified a path to follow, to try and improve the experience of everyone involved in writing or running tests. For our QA engineers, having good context about the state of tests is not a challenge, but accessing dashboards and/or re-running failures is an interrupting distraction from the current work in focus. And for our developers, keeping up with the constant updates and challenges related to the tests maintenance can be a bottleneck as well.

The implementation of our AI-powered solution lives entirely inside the existing “trupi-extended” GitHub Actions workflow, as a separate trailing job called trupi-failure-analysis. The separation is intentional: the test execution jobs run in parallel across platforms, as they already did, and have their results collected. Once all of them have completed, the analysis job decides whether it is worth running at all. That decision is encoded directly in the job condition. If all suites/platforms pass, no analysis runs. If more than 10 tests fail on one platform, no analysis runs either, as a mass failure almost always points to a shared infrastructure issue, which is self-evident and doesn’t need an LLM to explain. This way we avoid a context window so wide that the model loses focus on the failure signals that matter, while also keeping per-run costs within a predictable range.

Feeding AI only what it needs: the run2.json artifact

Each test platform produces a structured failure report, generated by a new dedicated plugin added to our trupi framework. The format is intentionally minimal, so that only the fields that genuinely matter for failure analysis are included:

{
  "runId": "run2",
  "totalScenarios": 1,
  "passed": 0,
  "failed": 1,
  "failures": [
    {
      "scenario": "Image filters in gallery",
      "id": "b8ae62fc-f6da-4bd1-8d67-f1a460f71fdf",
      "tags": [
        "@extended",
        "@test"
      ],
      "totalSteps": 8,
      "completedSteps": 5,
      "durationMs": 52078,
      "lastUrl": "https://trunk.trivago.com/en-GB/srl?search\u003d100-2615537%3Bdr-20260329-20260330%3Bdrs-40%3Brc-1-2#overlay-gallery",
      "errorType": "AssertionFailedError",
      "errorMessage": "Expecting actual:",
      "stackTrace": [
        "at glue.GallerySteps.iSeeAllPhotosFilterIsSelectedAndOrderOfTheFiltersIsMaintained(GallerySteps.java:252)"
      ],
      "selector": null,
      "steps": [
        {
          "index": 4,
          "keyword": "Then",
          "text": "I see image filters are shown on top on the gallery",
          "status": "PASSED",
          "note": "last passing step"
        },
        {
          "index": 5,
          "keyword": "And",
          "text": "I see \u0027All photos\u0027 filter is selected and order of the filters is maintained",
          "status": "FAILED",
          "note": null
        }
      ]
    }
  ]
}

Importantly, this is the report from the second test run, as within the specific test workflow we automatically rerun all failing scenarios once. If a scenario passes on the second attempt, it’s excluded from the AI report. Only persistent failures make it through, which means the AI is reasoning about a significantly cleaner signal from the start.

The three platform reports (desktop, mobile, ssr-mobile) are downloaded as GitHub Actions artifacts and merged into a single unified context file. As part of that merge, a cross-platform correlation is computed with a simple jq expression. Scenarios that appear in both desktop and mobile failure lists are flagged, since the same test failing on two independent browser setups almost always points to a backend or shared-code issue rather than a UI regression.

Building context in layers

Once the failure data is assembled, the job collects the PR change context in three layers of increasing granularity:

Commit log (git log —oneline): a concise list of what changed. No AI needed to enumerate commits, but the LLM benefits from knowing the intent behind each one. Truncated at 10KB.
File-level diff stats (git diff —stat): which files changed and how much, useful for spotting if a heavily modified component overlaps with a failing test area. Truncated at 10KB.
Filtered detailed diff: the actual code changes, but scoped only to directories that matter like components/, pages/, services/ etc. and the test feature files themselves. Everything else (build configs, CI workflows, documentation) is excluded. Truncated at 24KB.
Alongside these, the per-platform test_history.json files, populated from our Elasticsearch index via a query scoped to the last 24 hours, across all branches except the current one. The query response provides recent pass/fail rates for each failing scenario. If a scenario has failed for example seven out of its last ten runs on other branches, that’s the strongest signal the AI could possibly receive, and it doesn’t require any complex reasoning to interpret. Among other fields, the scenario failure detailed message is also available, so the analysis is not limited to looking at numbers.

Keeping the context window compact

Before invoking the LLM, the job estimates the total prompt size and applies a hard token guard:

PROMPT_CHARS=$(wc -c < "$PROMPT_FILE" | tr -d ' ')
EST_TOKENS=$((PROMPT_CHARS / 4))

if [[ "$EST_TOKENS" -gt 50000 ]]; then
    echo "WARNING: Prompt exceeds 50K tokens — trimming diff context and regenerating prompt"
    echo "(Detailed diff omitted due to size — see diff stat above)" > context/diff_filtered.txt
    DIFF_FILTERED=$(cat context/diff_filtered.txt)

    # a detailed prompt follows...

This is a simple but effective safety net: if the context window would become too large, typically when a PR has an unusually large diff, the detailed diff is dropped and replaced with a placeholder, while the commit log, file stats, and test history are preserved. Accuracy degrades slightly, but the job still runs within budget and produces a useful output.

The actual AI call

With context assembled and validated, the Cursor CLI is installed on the runner and invoked in non-interactive mode:

"$CURSOR_BIN" \
  -p "$(cat "$PROMPT_FILE")" \
  --force \
  --mode ask \
  --model "gemini-3-flash" \
  --output-format text > "$ANALYSIS_FILE"

The —mode ask flag is key here as it keeps the agent in a stateless, single-turn mode. No file exploration, no tool calls, just reasoning over the provided context. This is precisely what we want in a CI environment: predictable, fast, read-only. The prompt instructs the model to follow a strict instruction hierarchy when assigning root causes: flaky test history first, then infrastructure signals, then PR-introduced regression last. This prevents the common LLM tendency to assume the most recent change is always responsible for a failure. Our instructions are on the line of: flaky until proven guilty.

The resulting Markdown analysis is written directly to the GitHub Actions Step Summary. A hint in the outcome comment on the pull request, that is just the trigger comment later updated by the workflow, points developers to it: ”🪄 AI failure analysis will appear in the workflow summary within 2–3 minutes.”

GitHub test results comment

90 seconds to insights

There have been multiple iterations to optimize the execution time and the quality of the output, while maintaining costs within a controlled range. We currently rely on Google’s gemini-3-flash model, which helps us achieve a good job runtime (30s–3 minutes, in most cases less than 1 minute) and an acceptable cost, often capped to $0.02/$0.03 per run. And here are my 2 cents: spent on significantly reducing time dedicated to failure analysis and all the context switching caused by flaky tests that require comparison with other test runs. Considering the dozens of workflow runs that happen during a typical working day, we can consider the cost of introducing such improvement across the whole repository as low as a cup of coffee per day. An affordable Italian espresso, preferably.

When there is flakiness of a test across many branches, the message states it clearly, mentioning also how often it passed or failed. When the LLM instead suspects that there is regression due to a reasonable correlation to a change in the pull request, it can hint at what change might be the cause. Last but not least, when the failing test just needs a locator update, for example because it’s relying on a fragile xpath expression that is not valid anymore, the LLM is also capable of suggesting exactly the requested update in detail.

Test failure analysis example

Results and development impact

We also had a real qualitative impact across the teams, as our developers, who are not always invested in the end-to-end layer of test automation, now have a quick way to understand if they are “guilty” or not of a test failure. We moved from “Why always me?” and rolling eyes to higher confidence in evaluating test failures and hence shortened times to ship features to production. Less blind reruns of tests that failed due to environment issues also mean some saving on CI bills, that might counterbalance the limited costs of the new feature.

What’s next

The AI analysis of failures now empowers us to take faster decisions about test results, but we also see potential for additional features. For example, the AI could also decide to rerun some of the failed tests or directly raise a separate pull request to fix flaky tests. Even easier would be to post a concise action point back to the pull request as comment, but would that implicitly lead to skipping reading even the few lines of the full summary? Sometimes solutions that are technically quick to implement have unaccounted process and mindset effects that have to be analyzed carefully. We still want conscious decisions and “AI in the loop”, rather than “Human in the loop”. A training manual by IBM, back in 1979, stated: “A computer can never be held accountable, therefore a computer must never make a management decision” and indeed a firm objective is to keep our QA engineers and developers in control, but with enhanced situational awareness and unwavering focus on what really makes a difference.