Published Papers

Below, you will find a list of published papers I have been involved in.

2024 :)

ReproHum #0712-01: Reproducing Human Evaluation of Meaning Preservation in Paraphrase Generation


Reproducibility is a cornerstone of scientific research, ensuring the reliability and generalisability of findings. The ReproNLP Shared Task on Reproducibility of Evaluations in NLP aims to assess the reproducibility of human evaluation studies. This paper presents a reproduction study of the human evaluation experiment in “Hierarchical Sketch Induction for Paraphrase Generation” by Hosking et al. (2022). The original study employed a human evaluation on Amazon Mechanical Turk, assessing the quality of paraphrases generated by their proposed model using three criteria: meaning preservation, fluency, and dissimilarity. In our reproduction study, we focus on the meaning preservation criterion and utilise the Prolific platform for participant recruitment, following the ReproNLP challenge’s common approach to reproduction. We discuss the methodology, results, and implications of our reproduction study, comparing them to the original findings. Our findings contribute to the understanding of reproducibility in NLP research and highlights the potential impact of platform changes and evaluation criteria on the reproducibility of human evaluation studies.


Unveiling NLG Human-Evaluation Reproducibility: Lessons Learned and Key Insights from Participating in the ReproNLP Challenge

  • Publication: ACL Anthology
  • Date: 7th Sept 2023
  • Conference: The 3rd Workshop on Human Evaluation of NLP Systems (HumEval’23) at RANLP 2023
  • Authors: Lewis Watson and Dimitra Gkatzia
  • Institution: School of Computing, Engineering, and the Built Environment. Edinburgh Napier University.


Human evaluation is crucial for NLG systems as it provides a reliable assessment of the quality, effectiveness, and utility of generated language outputs. However, concerns about the reproducibility of such evaluations have emerged, casting doubt on the reliability and generalisability of reported results. In this paper, we present the findings of a reproducibility study on a data-to-text system, conducted under two conditions: (1) Replicating the original setup as closely as possible with evaluators from AMT. (2) Replicating the original human evaluation but this time, utilizing evaluators with a background in academia. Our experiments show that there is a loss of statistical significance between the original and reproduction studies, i.e. the human evaluation results are not reproducible. In addition, we found that employing local participants led to more robust results. We finally discuss lessons learned, addressing the challenges and best practices for ensuring reproducibility in NLG human evaluations.

Backup Links: Backup 1 (Workshop) or Backup 2 (Self)

Missing Information, Unresponsive Authors, Experimental Flaws: The Impossibility of Assessing the Reproducibility of Previous Human Evaluations in NLP

  • Publication: ArXiv or PDF
  • Date: 7th Aug 2023


We report our efforts in identifying a set of previous human evaluations in NLP that would be suitable for a coordinated study examining what makes human evaluations in NLP more/less reproducible. We present our results and findings, which include that just 13% of papers had (i) sufficiently low barriers to reproduction, and (ii) enough obtainable information, to be considered for reproduction, and that all but one of the experiments we selected for reproduction was discovered to have flaws that made the meaningfulness of conducting a reproduction questionable. As a result, we had to change our coordinated study design from a reproduce approach to a standardise-then-reproduce-twice approach. Our overall (negative) finding that the great majority of human evaluations in NLP is not repeatable and/or not reproducible and/or too flawed to justify reproduction, paints a dire picture, but presents an opportunity for a rethink about how to design and report human evaluations in NLP.

Honours Dissertation: Investigating Reproducibility of Human Evaluations In Natural Language Generation

Note: This diss is uploaded as submitted. Since being submitted the paper and some results have been corrected, please see the latest version of the actual papers above.

  • Link: PDF
  • Author: Lewis Watson
  • Degree: BSc Computer Science (Hons) @ Edinburgh Napier University
  • Date: 19th April 2023


Reproducibility is a cornerstone of scientific research, the relevance of reproducibility in evaluating Natural Language Generation (NLG) systems has been increasingly recognised. In NLG, human evaluations in which crowd-sourced annotators score the quality of generated text are frequently applied. However, due to a number of factors, including claimed study design and details, choice of metrics, and the variance of human annotators, the reproducibility of these evaluations is frequently questioned. The purpose of this study is to examine the reproducibility of human evaluations in NLG systems and to pinpoint the causes of irreproducibility. To do this, we evaluate the relevant literature and reproduce a human evaluation experiment from a previous study, comparing the outcomes to the reported findings in the original study. The Human Evaluations Data Sheet (HEDS) is utilised to enhance transparency and facilitate reproducibility. The project aims to provide valuable insights into making human evaluations of NLG systems more reproducible and propose potential solutions to mitigate irreproducibility in future studies.

Stay tuned for more publications soon. In the meantime, check out my blog posts here.