Published Papers

Below, you will find a list of published papers I have been involved in.

2026 :)

Frame2KG: A Benchmark and Evaluation Toolkit for Interpretable Frame-to-Graph Generation

Publication: LREC 2026 In Proceedings or Self-Hosted PDF
Date: 11–16th May 2026
Conference: Proceedings of the 2026 International Conference on Language Resources and Evaluation LREC 2026 - Palma, Mallorca
Authors: Lewis Watson, Carl Strathearn, Kenny Mitchell, and Yanchao Yu
Institution: School of Computing, Engineering, and the Built Environment, Edinburgh Napier University

Abstract

This work focuses on interpretable frame-to-knowledge-graph (Frame2KG) generation for embodied robots, targeting on-device inference to enhance privacy, improve interpretability, and minimise compute costs.

We introduce Frame2KG-YC2, a synthetic and reproducible dataset, and fine-tune Qwen2.5-VL models using LoRA applied to attention layers (QKVO), with and without GateProjUp/Down projections. For benchmarking, we propose a deterministic evaluation toolkit featuring two-stage node matching (IoU gate followed by Hungarian assignment on text similarity) and comprehensive graph-level metrics.

On a held-out test set, the best-performing model achieves a Node F1μ = 0.621, Edge F1μ = 0.208, and a mean matched IoU of ≈ 0.61, with over 98% schema conformity. Post-training quantisation maintains performance while improving suitability for edge deployment.

We release the dataset, code, adapters, and evaluation toolkit to establish an interpretable baseline for future temporal and multi-view extensions.

Publication: ACM Digital Library or Self-Hosted PDF
Date: 16–19th March 2026
Conference: HRI 2026 - Edinburgh, Scotland, UK
Authors: Lewis Watson, Emilia Sobolewska, Carl Strathearn, Mayuko Morgan, and Yanchao Yu
Institution: School of Computing, Engineering, and the Built Environment, Edinburgh Napier University

Abstract

A major limitation of current social robots is their dependence on cloud-based dialogue pipelines, which restricts use in settings with limited or unreliable connectivity. We present a lightweight, fully local spoken-dialogue system that runs on consumer-grade hardware and integrates open-source models for speech recognition, dialogue generation, and text-to-speech. The pipeline was deployed on a non-commercial humanoid robot across several public engagement events, enabling extended real-world interaction without internet access. We analyse over 5,000 dialogue turns to characterise system behaviour, user interaction patterns, and challenges arising in noisy, multi-speaker environments. Our observations demonstrate the feasibility of privacy-preserving, on-device conversational robotics while highlighting limitations in turn-taking, response length, and environmental grounding.

PAIR: A Pilot Dataset for Dual Perspective-based Video-Grounded Dialogue and Reconciliation

Publication: LREC 2026 In Proceedings or Self-Hosted PDF
Date: 11–16th May 2026
Conference: Proceedings of the 2026 International Conference on Language Resources and Evaluation LREC 2026 - Palma, Mallorca
Authors: Lewis Watson, Carl Strathearn, Kenny Mitchell, and Yanchao Yu
Institution: School of Computing, Engineering, and the Built Environment, Edinburgh Napier University

Abstract

An ongoing challenge in task-based multi-agent systems is enabling collaborators to solve problems when each has only a partial view of the environment. Achieving a shared understanding requires more than simple information exchange; it involves reconciling perspectives, negotiating interpretations, and engaging in joint problem-solving.

We introduce PAIR, a pilot conversational corpus designed to capture how humans integrate complementary observations when interpreting dynamic video scenes. PAIR comprises 15 dialogues in which participants viewed the same event from egocentric and exocentric perspectives, then engaged in face-to-face discussions to construct a shared account. Transcripts were manually verified and annotated with 42 dialogue act categories, revealing interactional strategies such as questioning, clarification, and agreement.

While lightweight, PAIR foregrounds collaborative sense-making in task-oriented dialogue, providing a controlled testbed for dialogue act classification, video-grounded dialogue modelling, and multi-agent reasoning. The dataset is released openly to support the development and benchmarking of systems that must reconcile complementary inputs to achieve common goals.

2024

Exploring the Impact of Data Representation on Neural Data-to-Text Generation

Publication: ACL Anthology or PDF (Self-hosted)
Date: 23-27th September 2024
Conference: The 17th International Natural Language Generation Conference - INLG 2024
Authors: David M. Howcroft, Lewis N. Watson, Olesia Nedopas, and Dimitra Gkatzia
Institution: School of Computing, Engineering, and the Built Environment, Edinburgh Napier University

Abstract

A relatively under-explored area in research on neural natural language generation is the impact of the data representation on text quality. Here we report experiments on two leading input representations for data-to-text generation: attribute-value pairs and Resource Description Framework (RDF) triples. Evaluating the performance of encoder-decoder seq2seq models as well as recent large language models (LLMs) with both automated metrics and human evaluation, we find that the input representation does not seem to have a large impact on the performance of either purpose-built seq2seq models or LLMs. Finally, we present an error analysis of the texts generated by the LLMs and provide some insights into where these models fail.

ReproHum #0712-01: Reproducing Human Evaluation of Meaning Preservation in Paraphrase Generation

Publication: ACL Anthology or Self-hosted PDF
Date: 21st May 2024
Conference: The 4th Workshop on Human Evaluation of NLP Systems (HumEval’24) at LREC-COLING 2024
Authors: Lewis Watson and Dimitra Gkatzia
Institution: School of Computing, Engineering, and the Built Environment. Edinburgh Napier University.

Abstract

Reproducibility is a cornerstone of scientific research, ensuring the reliability and generalisability of findings. The ReproNLP Shared Task on Reproducibility of Evaluations in NLP aims to assess the reproducibility of human evaluation studies. This paper presents a reproduction study of the human evaluation experiment in “Hierarchical Sketch Induction for Paraphrase Generation” by Hosking et al. (2022). The original study employed a human evaluation on Amazon Mechanical Turk, assessing the quality of paraphrases generated by their proposed model using three criteria: meaning preservation, fluency, and dissimilarity. In our reproduction study, we focus on the meaning preservation criterion and utilise the Prolific platform for participant recruitment, following the ReproNLP challenge’s common approach to reproduction. We discuss the methodology, results, and implications of our reproduction study, comparing them to the original findings. Our findings contribute to the understanding of reproducibility in NLP research and highlights the potential impact of platform changes and evaluation criteria on the reproducibility of human evaluation studies.

2023

Unveiling NLG Human-Evaluation Reproducibility: Lessons Learned and Key Insights from Participating in the ReproNLP Challenge

Publication: ACL Anthology
Date: 7th Sept 2023
Conference: The 3rd Workshop on Human Evaluation of NLP Systems (HumEval’23) at RANLP 2023
Authors: Lewis Watson and Dimitra Gkatzia
Institution: School of Computing, Engineering, and the Built Environment. Edinburgh Napier University.

Abstract

Human evaluation is crucial for NLG systems as it provides a reliable assessment of the quality, effectiveness, and utility of generated language outputs. However, concerns about the reproducibility of such evaluations have emerged, casting doubt on the reliability and generalisability of reported results. In this paper, we present the findings of a reproducibility study on a data-to-text system, conducted under two conditions: (1) Replicating the original setup as closely as possible with evaluators from AMT. (2) Replicating the original human evaluation but this time, utilizing evaluators with a background in academia. Our experiments show that there is a loss of statistical significance between the original and reproduction studies, i.e. the human evaluation results are not reproducible. In addition, we found that employing local participants led to more robust results. We finally discuss lessons learned, addressing the challenges and best practices for ensuring reproducibility in NLG human evaluations.

Backup Links: Backup 1 (Workshop) or Backup 2 (Self)

Missing Information, Unresponsive Authors, Experimental Flaws: The Impossibility of Assessing the Reproducibility of Previous Human Evaluations in NLP

Publication: ArXiv or PDF (Self-hosted)
Date: 7th Aug 2023

Abstract

We report our efforts in identifying a set of previous human evaluations in NLP that would be suitable for a coordinated study examining what makes human evaluations in NLP more/less reproducible. We present our results and findings, which include that just 13% of papers had (i) sufficiently low barriers to reproduction, and (ii) enough obtainable information, to be considered for reproduction, and that all but one of the experiments we selected for reproduction was discovered to have flaws that made the meaningfulness of conducting a reproduction questionable. As a result, we had to change our coordinated study design from a reproduce approach to a standardise-then-reproduce-twice approach. Our overall (negative) finding that the great majority of human evaluations in NLP is not repeatable and/or not reproducible and/or too flawed to justify reproduction, paints a dire picture, but presents an opportunity for a rethink about how to design and report human evaluations in NLP.

Honours Dissertation: Investigating Reproducibility of Human Evaluations In Natural Language Generation

Note: This diss is uploaded as submitted. Since being submitted the paper and some results have been corrected, please see the latest version of the actual papers above.

Link: PDF
Author: Lewis Watson
Degree: BSc Computer Science (Hons) @ Edinburgh Napier University
Date: 19th April 2023

Abstract

Reproducibility is a cornerstone of scientific research, the relevance of reproducibility in evaluating Natural Language Generation (NLG) systems has been increasingly recognised. In NLG, human evaluations in which crowd-sourced annotators score the quality of generated text are frequently applied. However, due to a number of factors, including claimed study design and details, choice of metrics, and the variance of human annotators, the reproducibility of these evaluations is frequently questioned. The purpose of this study is to examine the reproducibility of human evaluations in NLG systems and to pinpoint the causes of irreproducibility. To do this, we evaluate the relevant literature and reproduce a human evaluation experiment from a previous study, comparing the outcomes to the reported findings in the original study. The Human Evaluations Data Sheet (HEDS) is utilised to enhance transparency and facilitate reproducibility. The project aims to provide valuable insights into making human evaluations of NLG systems more reproducible and propose potential solutions to mitigate irreproducibility in future studies.

Published Papers

2026 :)

Frame2KG: A Benchmark and Evaluation Toolkit for Interpretable Frame-to-Graph Generation

Abstract

Conversational AI Without the Cloud: A Lightweight, Local Dialogue Pipeline for Non-Commercial Social Robots

Abstract

PAIR: A Pilot Dataset for Dual Perspective-based Video-Grounded Dialogue and Reconciliation

Abstract

2024

Exploring the Impact of Data Representation on Neural Data-to-Text Generation

Abstract

ReproHum #0712-01: Reproducing Human Evaluation of Meaning Preservation in Paraphrase Generation

Abstract

2023

Unveiling NLG Human-Evaluation Reproducibility: Lessons Learned and Key Insights from Participating in the ReproNLP Challenge

Abstract

Missing Information, Unresponsive Authors, Experimental Flaws: The Impossibility of Assessing the Reproducibility of Previous Human Evaluations in NLP

Abstract

Honours Dissertation: Investigating Reproducibility of Human Evaluations In Natural Language Generation

Abstract

Stay tuned for more publications soon. In the meantime, check out my blog posts here.