A recent collaborative study by University of Maryland (UMD) and Microsoft investigated how well major AI language models respond to prompts in different human languages. The researchers tested a set of leading models—including those from OpenAI, Google Gemini, Qwen, Llama and DeepSeek—by presenting identical tasks in 26 languages and measuring their accuracy. The surprising finding was that Polish emerged as the top-performing language for prompting AI, while English only ranked sixth.
According to the results, when asked to complete longer text tasks, Polish prompts achieved an average accuracy of about 88 %. In contrast, English prompts achieved around 83.9 %. The researchers found that the volume of training data in a given language did not strictly predict prompting performance: even though Polish has much less data available for training compared to languages like English or Chinese, it still outperformed those languages in this prompt-effectiveness test.
Interestingly, the study also reported that Chinese underperformed, ranking fourth from the bottom among the 26 languages tested. The finding challenges the assumption that more widely used or high-resource languages necessarily yield better AI performance in prompt interpretation. This encourages further investigation into what features of a language make it "effective" for AI prompting.
The implications of this study are significant for AI-powered workflows and multilingual systems. It suggests that users might obtain better outcomes by carefully selecting the language in which they craft prompts—not just relying on English by default. For organisations developing multilingual AI tools, it raises questions about how prompting effectiveness varies across languages and what linguistic or structural factors contribute to a language’s performance in AI contexts.