PERFORMANCE OF CHATGPT-5.1 AND GEMINI 2.5 FLASH ON THE UKRAINIAN LICENSING INTEGRATED EXAMINATION “KROK 3”: A COMPARATIVE STUDY
DOI:
https://doi.org/10.30888/2663-5712.2026-35-02-006Keywords:
large language models, artificial intelligence, medical licensing examination, Krok 3, multiple-choice questions, role-based prompting, response stabilityAbstract
The development of large language models (LLMs) has opened new perspectives for their integration into healthcare. Previous studies have demonstrated the high effectiveness of leading LLMs, such as GPT-4, in passing standardized English-language medicalReferences
Vrdoljak, J., Boban, Z., Vilović, M., Kumrić, M., & Božić, J. (2025). A Review of Large Language Models in Medical Education, Clinical Decision Support, and Healthcare Administration. Healthcare, 13(6), 603. https://doi.org/10.3390/healthcare13060603.
Advances in Large Language Models for Medicine – arXiv, https://arxiv.org/html/2509.18690v1.
Lin, C., & Kuo, C.-F. (2025). Roles and Potential of Large Language Models in Healthcare: A Comprehensive Review. Biomedical Journal, 100868. https://doi.org/10.1016/j.bj.2025.100868.
Zong, H., Wu, R., Cha, J., Wang, J., Wu, E., Li, J., Zhou, Y., Zhang, C., Feng, W., & Shen, B. (2024). Large Language Models in Worldwide Medical Exams: Platform Development and Comprehensive Analysis. Journal of medical Internet research, 26, e66114. https://doi.org/10.2196/66114.
Nouri, H., Mahdavi, A., Abedi, A., Mohammadnia, A., Hamedan, M., & Amanzadeh, M. (2025). Performance of large language models in medical licensing examinations: a systematic review and meta-analysis. Journal of educational evaluation for health professions, 22, 36. https://doi.org/10.3352/jeehp.2025.22.36.
Yang, Z., Yao, Z., Tasmin, M., Vashisht, P., Jang, W. S., Ouyang, F., Wang, B., McManus, D., Berlowitz, D., & Yu, H. (2025). Unveiling GPT-4V's hidden challenges behind high accuracy on USMLE questions: Observational Study. Journal of medical Internet research, 27, e65146. https://doi.org/10.2196/65146.
Nori, H., King, N., McKinney, S.M., Carignan, D., & Horvitz, E. (2023). Capabilities of GPT-4 on Medical Challenge Problems. ArXiv, abs/2303.13375.
Workum, J. D., Volkers, B. W. S., van de Sande, D., Arora, S., Goeijenbier, M., Gommers, D., & van Genderen, M. E. (2025). Comparative evaluation and performance of large language models on expert level critical care questions: a benchmark study. Critical care (London, England), 29(1), 72. https://doi.org/10.1186/s13054-025-05302-0.
Kasagga, A., Sapkota, A., Changaramkumarath, G., Abucha, J. M., Wollel, M. M., Somannagari, N., Husami, M. Y., Hailu, K. T., & Kasagga, E. (2025). Performance of ChatGPT and Large Language Models on Medical Licensing Exams Worldwide: A Systematic Review and Network Meta-Analysis With Meta-Regression. Cureus, 17(10), e94300. https://doi.org/10.7759/cureus.94300.
Wu, J., Wang, Z., & Qin, Y. (2025). Performance of DeepSeek-R1 and ChatGPT-4o on the Chinese National Medical Licensing Examination: A Comparative Study. Journal of medical systems, 49(1), 74. https://doi.org/10.1007/s10916-025-02213-z.
Guillen-Grima, F., Guillen-Aguinaga, S., Guillen-Aguinaga, L., Alas-Brun, R., Onambele, L., Ortega, W., Montejo, R., Aguinaga-Ontoso, E., Barach, P., & Aguinaga-Ontoso, I. (2023). Evaluating the Efficacy of ChatGPT in Navigating the Spanish Medical Residency Entrance Examination (MIR): Promising Horizons for AI in Clinical Medicine. Clinics and practice, 13(6), 1460–1487. https://doi.org/10.3390/clinpract13060130.
Fujimoto, M., Kuroda, H., Katayama, T., Yamaguchi, A., Katagiri, N., Kagawa, K., Tsukimoto, S., Nakano, A., Imaizumi, U., Sato-Boku, A., Kishimoto, N., Itamiya, T., Kido, K., & Sanuki, T. (2024). Evaluating Large Language Models in Dental Anesthesiology: A Comparative Analysis of ChatGPT-4, Claude 3 Opus, and Gemini 1.0 on the Japanese Dental Society of Anesthesiology Board Certification Exam. Cureus, 16(9), e70302. https://doi.org/10.7759/cureus.70302.
Liu, M., Okuhara, T., Dai, Z., Huang, W., Gu, L., Okada, H., Furukawa, E., & Kiuchi, T. (2025). Evaluating the Effectiveness of advanced large language models in medical Knowledge: A Comparative study using Japanese national medical examination. International journal of medical informatics, 193, 105673. https://doi.org/10.1016/j.ijmedinf.2024.105673.
Gwet K. Handbook of inter-rater reliability: the definitive guide to measuring the extent of agreement among raters. Oxford: Advanced Analytics, LLC, Gaithersburg; 2014, p. 62-65.
Yang, Y., Jin, Q., Zhu, Q., Wang, Z., Erramuspe Álvarez, F., Wan, N., Hou, B., & Lu, Z. (2025). Beyond Multiple-Choice Accuracy: Real-World Challenges of Implementing Large Language Models in Healthcare. Annual Review of Biomedical Data Science. https://doi.org/10.1146/annurev-biodatasci-103123-094851.
Schlicht, I.B., Zhao, Z., Sayin, B., Flek, L., Rosso, P. (2025). Do LLMs Provide Consistent Answers to Health-Related Questions Across Languages?. In: Hauff, C., et al. Advances in Information Retrieval. ECIR 2025. Lecture Notes in Computer Science, vol 15574. Springer, Cham. https://doi.org/10.1007/978-3-031-88714-7_30.
Vasylovska, I. (2017). The modern state and problems of ukrainian medical terminography. Theory and Practice of Teaching Ukrainian as a Foreign Language, (13), 116–121.
Bicknell, B. T., Butler, D., Whalen, S., Ricks, J., Dixon, C. J., Clark, A. B., Spaedy, O., Skelton, A., Edupuganti, N., Dzubinski, L., Tate, H., Dyess, G., Lindeman, B., & Lehmann, L. S. (2024). Critical Analysis of ChatGPT 4 Omni in USMLE Disciplines, Clinical Clerkships, and Clinical Skills (Preprint). JMIR Medical Education. https://doi.org/10.2196/63430.
Downloads
Published
How to Cite
Issue
Section
License
Copyright (c) 2026 Authors

This work is licensed under a Creative Commons Attribution 4.0 International License.


