THE UNSEEN DATA: A STATISTICAL AND ENGINEERING PERSPECTIVE ON BIASES IN LARGE LANGUAGE MODELS

Authors

DOI:

https://doi.org/10.30888/2663-5712.2025-33-01-078

Keywords:

LLM, AI Bias, Data Imbalance, Fairness, Machine Learning, Computational Linguistics, Statistical Bias, Ethical AI

Abstract

The paper argues that bias in large language models (LLMs) is a fundamentally statistical problem rooted in the nature of their training data. The unfiltered datasets used for training are not representative samples of human language, but rather deeply im

References

Alqahtani, T., Badreldin, H. A., Alrashed, M., Alshaya, A. I., Alghamdi, S. S., bin Saleh, K., Alowais, S. A., Alshaya, O. A., Rahman, I., Al Yami, M. S. and Albekairy, A. M. (2023) 'The emergent role of artificial intelligence, natural learning processing, and large language models in higher education and research', Research in Social and Administrative Pharmacy, 19(8), pp. 1236–1242. doi: https://doi.org/10.1016/j.sapharm.2023.05.016.

Chiarello, F., Giordano, V., Spada, I., Barandoni, S. and Fantoni, G. (2024) 'Future applications of generative large language models: A data-driven case study on ChatGPT', Technovation, 133, p. 103002. doi: https://doi.org/10.1016/j.technovation.2024.103002.

De-Arteaga, M. et al. (2019) 'Bias in Bios: A Case Study of Semantic Representation Bias in a High-Stakes Setting', in Proceedings of the Conference on Fairness, Accountability, and Transparency, ACM. doi: https://doi.org/10.1145/3287560.3287572.

Gallegos, I. O. et al. (2024) 'Bias and Fairness in Large Language Models: A Survey', Computational Linguistics, 50(3), pp. 1097–1179. doi: https://doi.org/10.1162/coli_a_00524.

Guo, Y. et al. (2024) 'Bias in Large Language Models: Origin, Evaluation, and Mitigation', arXiv. doi: https://doi.org/10.48550/arXiv.2411.10915.

Makwana, D., Engineer, P., Dabhi, A. and Chudasama, H. (2023) 'Sampling Methods in Research: A Review', 7, pp. 762-768.

Mesko, B. (2023) 'The ChatGPT (Generative Artificial Intelligence) Revolution Has Made Artificial Intelligence Approachable for Medical Professionals', J Med Internet Res, 25, p. e48392. doi: https://doi.org/10.2196/48392.

Noguer I Alonso, M. (2024) 'Large Language Models in Finance: Reasoning', SSRN. doi: http://dx.doi.org/10.2139/ssrn.5048316.

Sakaguchi, K., Le Bras, R., Bhagavatula, C. and Choi, Y. (2020) 'WinoGrande: An Adversarial Winograd Schema Challenge at Scale', in Proceedings of the AAAI Conference on Artificial Intelligence, 34(05), pp. 8732–8740. doi: https://doi.org/10.1609/aaai.v34i05.6399.

Tumanov, O. O. (2019) 'Aspects of Using Social Media in Research', Scientific Bulletin of the National Academy of Statistics, Accounting and Audit, (4), pp. 24–29. doi: https://doi.org/10.31767/nasoa.4.2019.03.

Tumanov, O.O. (2019) 'Social media as an object of statistical research', Business Inform, (12), pp. 8–14. DOI: https://doi.org/10.32983/2222-4459-2019-12-8-14.

Tumanov, O. O. (2020) 'Statistical methods for analyzing social media data', Business Inform, 2, pp. 266–272. DOI: https://doi.org/10.32983/2222-4459-2020-2-266-272 .

Published

2025-09-30

How to Cite

Туманов, О. (2025). THE UNSEEN DATA: A STATISTICAL AND ENGINEERING PERSPECTIVE ON BIASES IN LARGE LANGUAGE MODELS. SWorldJournal, 1(33-01), 179–187. https://doi.org/10.30888/2663-5712.2025-33-01-078

Issue

Section

Articles