Kitchenham, B. A., Al-Kilidar, H., Babar, M. A., Berry,
M., et al. (2008). Evaluating guidelines for reporting
empirical software engineering studies. Empir. Softw.
Eng., 13(1):97–121.
Kottner, J., Audigé, L., Brorson, S., Donner, A., et al.
(2010). Guidelines for reporting reliability and agree-
ment studies (GRRAS) were proposed. J Clin Epi-
demiol, 64(1):96–106.
Krippendorff, K. (2004). Content Analysis: An Introduc-
tion to Its Methodology (second edition). Sage Publi-
cations.
Kung, T. H., Cheatham, M., Medenilla, A., Sillos, C., et al.
(2023). Performance of chatgpt on usmle: Potential
for ai-assisted medical education using large language
models. PLOS Digital Health, 2(2):1–12.
Landis, J. R. and Koch, G. G. (1977). The measurement of
observer agreement for categorical data. Biometrics,
33(1).
Le, T. H. M., Chen, H., and Babar, M. A. (2020). Deep
learning for source code modeling and generation:
Models, applications, and challenges. ACM Comput.
Surv., 53(3).
Li, Y., Choi, D., Chung, J., Kushman, N., et al. (2022).
Competition-level code generation with AlphaCode.
Science, 378(6624):1092–1097.
Liu, P. J., Saleh, M., Pot, E., Goodrich, B., et al.
(2018). Generating wikipedia by summarizing long
sequences. arXiv preprint arXiv:1801.10198.
Liu, Y., Ott, M., Goyal, N., Du, J., et al. (2019). Roberta:
A robustly optimized bert pretraining approach. arXiv
preprint arXiv:1907.11692.
Madeyski, L. and Kitchenham, B. A. (2018). Effect sizes
and their variance for AB/BA crossover design stud-
ies. Empir. Softw. Eng., 23(4):1982–2017.
Mikolov, T., Sutskever, I., Chen, K., Corrado, G. S., and
Dean, J. (2013). Distributed representations of words
and phrases and their compositionality. In Burges, C.,
Bottou, L., Welling, M., Ghahramani, Z., and Wein-
berger, K., editors, Advances in Neural Information
Processing Systems, volume 26. Curran Associates,
Inc.
Mirza, R., Punja, S., Vohra, S., and Guyatt, G. (2017).
The history and development of n-of-1 trials. Jour-
nal of the Royal Society of Medicine, 110(8):330–340.
PMID: 28776473.
Ouyang, L., Wu, J., Jiang, X., Almeida, D., et al. (2022).
Training language models to follow instructions with
human feedback. In Koyejo, S., Mohamed, S., Agar-
wal, A., Belgrave, D., et al., editors, Advances in Neu-
ral Information Processing Systems, volume 35, pages
27730–27744. Curran Associates, Inc.
Perdices, M., Schultz, R., Tate, R., Mcdonald, S., et al.
(2006). The evidence base of neuropsychological re-
habilitation in acquired brain impairment (abi): How
good is the research? Brain Impairment - BRAIN IM-
PAIR, 7:119–132.
Radford, A., Wu, J., Child, R., Luan, D., et al. (2019).
Language models are unsupervised multitask learners.
OpenAI blog, 1(8):9.
Senn, S. (2002). Cross-over Trials in Clinical Research.
Statistics in Practice. Wiley.
Vaswani, A., Shazeer, N., Parmar, N., Uszkoreit, J., et al.
(2017). Attention is all you need. In Guyon, I.,
Luxburg, U. V., Bengio, S., Wallach, H., et al., editors,
Advances in Neural Information Processing Systems,
volume 30. Curran Associates, Inc.
Vegas, S., Apa, C., and Juristo, N. (2016). Crossover de-
signs in software engineering experiments: Benefits
and perils. IEEE Transactions on Software Engineer-
ing, 42(2):120–135.
Wang, X., Liu, X., Zhou, P., Liu, Q., et al. (2023). Test-
driven multi-task learning with functionally equiva-
lent code transformation for neural code generation.
In Proceedings of the 37th IEEE/ACM International
Conference on Automated Software Engineering, ASE
’22, New York, NY, USA. Association for Computing
Machinery.
Wiseman, R. (1676). Eight Chirurgical Treatises. B.Tooke,
J. Knapton, T. Horne, forth edition.
Wohlin, C., Runeson, P., Höst, M., Ohlsson, M., et al.
(2012). Experimentation in Software Engineering.
Computer Science. Springer Berlin Heidelberg.
ICSOFT 2023 - 18th International Conference on Software Technologies
312