Improving Evaluation Practices in Natural Language Generation


Throughout last year I had the opportunity to participate and collaborate on multiple research initiatives in the field of Natural Language Generation (NLG) in addition to my responsibilities as a Data Scientist at trivago. NLG is the process of automatically generating text from either text and/or non-linguistic data inputs. Some NLG applications include chatbots, image captioning, and report generation. These are application areas of high interest internally within trivago as we seek to leverage our rich data environment to enrich the user experience with potential NLG applications.

In the past we have contributed to the area of NLG through the creation of an NLG system called “Hotel Scribe” to automatically generate descriptions of hotel accommodations from structured data that we have internally within trivago. This system showed significant promise in both internal evaluations and when compared to accommodation descriptions from an external source. The results of our system were published in our 2019 paper, which we presented at International Natural Language Generation Conference (INLG) 2019 in Tokyo Japan. We have also collaborated with other researchers by extensively looking at past human evaluations and finding significant issues in how they are conducted within the NLG research communities. These findings were published in our INLG 2020 research paper.

Evaluation Research

The main theme of research throughout 2021 was exploring diverse issues with respect to evaluation. This is a topic that has been increasingly receiving more attention within the Natural Language Processing (NLP) and Natural Language Generation (NLG) research communities. Especially as more and more shortcomings with current evaluation practices are identified. Issues such as automated metrics correlating poorly with human judgements, human evaluations having extreme diversities in approach, and finally the question of reproducibility in NLP/NLG evaluations.

We looked at human evaluations with our HumEval 2021 paper. In particular, we explored the nature of human evaluations being used in the creation of NLG based systems that incorporate some form of “Common sense” knowledge by analysing past published papers. Our work found that there was large variance in how such systems are evaluated, making it impossible to compare different systems and define baselines. We proposed as part of our HumEval paper a new Commonsense Evaluation Card (CEC) a set of recommendations for reporting evaluations for such systems.

There are also significant issues with automatic evaluations as well. Automatic evaluations such as BLEU or ROUGE compute the coverage between a given source and target text i.e. the number of tokens that appear in the target text. These metrics are especially used to evaluate large neural language models (e.g. models like GPT–3 etc.). In addition to the question of the appropriateness of using such automated metrics, there is also the question of whether these models are being evaluated on diverse enough datasets to give a sufficient representation of their performance. This is a topic we looked at in our NeurIPS 2021 paper, which looked at the possibility of going beyond summarising the performance of a given model with a single number for a particular criteria e.g. accuracy. Instead we suggested a new approach to allow for more in-depth analyses where researchers can evaluate against multiple challenge test sets generated automatically. In our experiments, we found significant variances between reported scores and those found by the challenge sets. For example we observed differences in how the models perform with regards to gender, language, and ethnicity sub-population datasets, which suggests that the models perform poorly when they encounter new concepts and words. These initial findings suggest there is a need in general to test on a broader set of data to best understand how well a given model will actually perform.

Our work for the NeurIPS paper was achieved through the creation of a new framework called NL-Augmenter to generate controlled perturbations from a given test dataset. We have open sourced the NL-Augmenter project on GitHub and already this has been a substantial collaborative effort to create multiple types of filters and transformations to create a whole variety of perturbations beyond what we had initially created in our initial paper.

Reproducibility was also an issue that was explored for the ReproGen 2021 shared task at INLG 2021. In our paper paper we looked at reproducing a past experiment that had previously been conducted with colleagues at trivago. This experiment reproduced an early experiment from 2007 that asked participants to compare generated texts with hedged phrases (e.g. “unfortunately”, “sadly”, etc.) against those that contain none. Colleagues were recruited to read these texts in two settings and were asked to fill their response for each setting through a questionnaire. The results from the reproducibility experiment showed that whilst it was straightforward to replicate the procedural aspects from the experiment, we were only able to partially replicate the results from our original results. This work showed two things: Firstly, the results obtained in earlier human evaluation may not be generalisable beyond the cohort of participants that initially participated. Secondly, it shows the imperative need for reproduction of results generally within research.

The need for reproducibility is particularly important as the increasing use of outputs from academic work are utilised in industry settings. For example, within trivago we make applied use of Machine Learning components internally in areas such as imaging tagging, item matching, etc. based on the work from academic research. Understanding how to competently evaluate the performance of these models and reproduce results will enable us to have greater confidence in such models when deployed into production and when making changes to add new features or capabilities.

Finally, the last area that we explored was the issue of underreporting of errors in NLG systems. In our INLG 2021 paper we observe that there is a significant issue with authors in general underreporting errors in general. For example, more than half of the papers we looked at from previous conferences failed to mention any errors. Whilst those that did mention errors were mostly at a superficial level. This means there is a systemic underreporting of the weakness of existing approaches that can give readers an incorrect perception of the robustness of existing approaches. We concluded in the paper that there needs to be a more concerted effort within the research community to improve standards when reporting the outcomes of NLG based experiments and that more detailed error reporting would help better inform the research community at large about the strengths and weaknesses of a given approach.

Research Project Partnerships

In addition to the above research work on evaluation, there were also two research projects that trivago gave in support as a project partner that were funded by the United Kingdom’s Engineering and Physical Sciences Research Council (EPSRC):

As a project partner trivago will provide support in the form of advice, data, and participation in these funded research initiatives.


The main takeaway over the past year is that there are significant issues with regards to evaluation practices in general irrespective of whether they are human or automatic. Improving standards in how NLP and NLG systems are evaluated and the reproducibility of such evaluations is imperative not just for academia, but also for those of us working in industry. More robust evaluations will allow us to better understand the advantages and disadvantages of a given approach and have more confidence in the results being presented. However, recent focus on this question gives me confidence that the research community will make significant progress in answering the questions being raised in the coming future.

Paper References

We're Hiring

Tackling hard problems is like going on an adventure. Solving a technical challenge feels like finding a hidden treasure. Want to go treasure hunting with us?