Nakano, R., Hesse, C., & Schulman, J. (2021). Training
Verifiers to Solve Math Word Problems.
http://arxiv.org/pdf/2110.14168.pdf
Georgiev, P., Lei, V. I., Burnell, R., Bai, L., Gulati, A.,
Tanzer, G., Vincent, D., Pan, Z., Wang, S., Mariooryad,
S., Ding, Y., Geng, X., Alcober, F., Frostig, R.,
Omernick, M., Walker, L., Paduraru, C., Sorokin, C.,
Tacchetti, A., Vinyals, O. (2024). Gemini 1.5:
Unlocking multimodal understanding across millions of
tokens of context. http://arxiv.org/pdf/2403.05530
Guo, Z., Jin, R., Liu, C [Chuang], Huang, Y., Shi, D.,
Supryadi, Yu, L [Linhao], Liu, Y [Yan], Li, J [Jiaxuan],
Xiong, B., & Xiong, D. (2023). Evaluating Large
Language Models: A Comprehensive Survey.
http://arxiv.org/pdf/2310.19736
Gupta, A., Vedaldi, A., & Zisserman, A. (2016). Synthetic
Data for Text Localisation in Natural Images.
http://arxiv.org/pdf/1604.06646
Guyon, I., Haralick, R. M., Hull, J. J., & Phillips, I. T.
(2000). Data sets for OCR and Document Image
Understanding Research. In H. Bunke (Ed.), Handbook
of character recognition and document image analysis
(1. publ., repr, pp. 779–799). World Scientific.
https://doi.org/10.1142/9789812830968_0030
Hartley, R. T., & Crumpton, K. (1999). Quality of OCR for
Degraded Text Images. http://arxiv.org/pdf/cs/9902009
Hegghammer, T. (2022). Ocr with Tesseract, Amazon
Textract, and Google Document AI: A benchmarking
experiment. Journal of Computational Social Science,
5(1), 861–882. https://doi.org/10.1007/s42001-021-
00149-1
Hendrycks, D., Burns, C., Kadavath, S., Arora, A., Basart,
S., Tang, E., Song, D., & Steinhardt, J. (2021).
Measuring Mathematical Problem Solving With the
MATH Dataset. http://arxiv.org/pdf/2103.03874.pdf
Huang, L., Yu, W [Weijiang], Ma, W., Zhong, W., Feng, Z.,
Wang, H., Chen, Q., Peng, W., Feng, X., Qin, B., & Liu,
T [Ting] (2024). A Survey on Hallucination in Large
Language Models: Principles, Taxonomy, Challenges,
and Open Questions. ACM Transactions on
Information Systems, Article 3703155. Advance online
publication. https://doi.org/10.1145/3703155
Jaderberg, M., Simonyan, K., Vedaldi, A., & Zisserman, A.
(2014). Synthetic Data and Artificial Neural Networks
for Natural Scene Text Recognition. http://arxiv.org/
pdf/1406.2227
Jaderberg, M., Simonyan, K., Vedaldi, A., & Zisserman, A.
(2016). Reading Text in the Wild with Convolutional
Neural Networks. International Journal of Computer
Vision, 116(1), 1–20. https://doi.org/10.1007/s11263-
015-0823-z
Jaderberg, M., Vedaldi, A., & Zisserman, A. (2014). Deep
Features for Text Spotting. In (pp. 512–528). Springer,
Cham. https://doi.org/10.1007/978-3-319-10593-2_34
Karatzas, D., Shafait, F., Uchida, S., Iwamura, M., Bigorda,
L. G. i., Mestre, S. R., Mas, J., Mota, D. F., Almazan, J.
A., & las Heras, L. P. de (2013). ICDAR 2013 Robust
Reading Competition. In 2013 12
th
International
Conference on Document Analysis and Recognition (pp.
1484–1493). IEEE. https://doi.org/10.1109/ICDAR.20
13.221
Kocmi, T., Federmann, C., Grundkiewicz, R., Junczys-
Dowmunt, M., Matsushita, H., & Menezes, A. (2021).
To Ship or Not to Ship: An Extensive Evaluation of
Automatic Metrics for Machine Translation.
Proceedings of the Sixth Conference on Machine
Translation, 478–494. https://aclanthology.org/2021.
wmt-1.57/
Kojima, T., Gu, S. S., Reid, M., Matsuo, Y., & Iwasawa, Y.
(2022). Large Language Models are Zero-Shot
Reasoners. http://arxiv.org/pdf/2205.11916
Lee, S [Seungjun], Lee, J [Jungseob], Moon, H., Park, C.,
Seo, J., Eo, S., Koo, S., & Lim, H [Heuiseok] (2023). A
Survey on Evaluation Metrics for Machine Translation.
Mathematics, 11(4), 1006. https://doi.org/10.3390/
math11041006
Lin, C.‑Y. (2004). ROUGE: A Package for Automatic
Evaluation of Summaries. Text Summarization
Branches Out, 74–81. https://aclanthology.org/W04-
1013/
Liu, Y [Yuliang], Li, Z [Zhang], Huang, M., Yang, B., Yu,
W [Wenwen], Li, C [Chunyuan], Yin, X., Liu, C
[Cheng-lin], Jin, L., & Bai, X. (2023). OCRBench: On
the Hidden Mystery of OCR in Large Multimodal
Models. http://arxiv.org/pdf/2305.07895
Lu, P., Bansal, H., Xia, T., Liu, J [Jiacheng], Li, C
[Chunyuan], Hajishirzi, H., Cheng, H., Chang, K.‑W.,
Galley, M., & Gao, J. (2023). MathVista: Evaluating
Mathematical Reasoning of Foundation Models in
Visual Contexts. http://arxiv.org/pdf/2310.02255
Masry, A., Long, D., Tan, J. Q., Joty, S., & Hoque, E.
(2022). ChartQA: A Benchmark for Question
Answering about Charts with Visual and Logical
Reasoning. Findings of the Association for
Computational Linguistics: ACL 2022, 2263–2279.
https://doi.org/10.18653/v1/2022.findings-acl.177
McCoy, R. T., Smolensky, P., Linzen, T., Gao, J., &
Celikyilmaz, A. (2023). How Much Do Language
Models Copy From Their Training Data? Evaluating
Linguistic Novelty in Text Generation Using RAVEN.
Transactions of the Association for Computational
Linguistics, 11, 652–670. https://doi.org/10.1162/ta
cl_a_00567
Meta. (2024). Llama 3.2. https://www.llama.com/
Mishra, A., Shekhar, S., Singh, A. K., & Chakraborty, A
[Anirban] (2019). OCR-VQA: Visual Question
Answering by Reading Text in Images. In 2019
International Conference on Document Analysis and
Recognition (ICDAR) (pp. 947–952). IEEE.
https://doi.org/10.1109/ICDAR.2019.00156
Neudecker, C., Baierer, K., Gerber, M., Clausner, C.,
Antonacopoulos, A., & Pletschacher, S. (2021). A
survey of OCR evaluation tools and metrics. In The 6
th
International Workshop on Historical Document
Imaging and Processing (pp. 13–18). ACM.
https://doi.org/10.1145/3476887.3476888
OpenAI. (2023). GPT-4 Technical Report. http://arxiv.org/
pdf/2303.08774.pdf