Is generative AI a winning tool for research?
- Scientists are currently testing methods of integrating large language models (LLMs) into research practices, which raises a number of questions.
- LLMs are effective in detecting the tone of an article or comment, but less so in detecting rhetorical forms.
- LLMs are most commonly used for text classification in social sciences, changing the way research is conducted.
- There are risks associated with LLMs, such as the inability to replicate work, lack of data security, and the use of poor-quality data.
- It is crucial to reflect on AI’s contributions to research through the lens of scientific method.
You have co-authored an article on the dangers of artificial intelligence (AI) in research. Why did you decide to carry out this work?
Arnault Chatelain. Today, scientists are experimenting with large language models (LLMs), which are an important part of AI. Everyone is testing different methods to integrate them into research practices, but many questions remain. For certain applications, these LLMs are very effective. For example, they are good at detecting the tone of an article or comment. However, they become much less effective for more complicated tasks, such as detecting rhetorical forms.
How are scientists using AI in their work?
I will only comment on the field I am familiar with, namely the social sciences, and more specifically economics, sociology and political science. Scientists mainly use LLMs to assist them and process large amounts of text. The first application is fairly generic: reformatting texts, reorganising data tables, writing computer code, etc. The use of ChatGPT-type chatbots saves time, as many users outside scientific research have discovered.
The most common use of LLMs in social sciences is text classification. Previously, the study of large amounts of text was done manual, a very time-consuming process. Today, it is possible to manually annotate a sample of text and then extend it to a corpus of texts using language models. In our computational social science research team, we are trying to detect the use of rare rhetorical forms in the press. We annotate around a hundred articles, and we can then extend our annotations to the entire press corpus. This gives us an overview that would have been impossible to produce without AI. In this sense, this tool increases our possibilities and changes the way we do research.
What dangers do you see in using AI for scientific research?
First of all, there is a risk concerning replicability. The replicability of results is essential to the scientific method. However, proprietary models [editor’s note: owned by private companies] evolve and can disappear overnight, as is the case with older versions of ChatGPT3.5. This makes it impossible to replicate the work. Another danger concerns data security. For scientists working with sensitive data, such as health data, it is important not to share data with private companies. However, the temptation can be strong in the absence of easily accessible non-proprietary alternatives. To avoid any risk, it would therefore be preferable to use freely accessible models downloaded locally, but this requires adequate infrastructure. Finally, I have observed that models rely on large amounts of data, which can sometimes be of poor quality. We still have a limited understanding of the type of bias that this can produce within models.
What are the causes of these limitations?
With proprietary models, the problem is precisely that we do not have control over the model we are using. Another issue stems from the fact that we do not fully understand how LLMs work, whether they are proprietary or open source. Even when we have access to the code, we are unable to explain the results obtained by AI. It has been demonstrated that by repeating the same tasks on the same model for several months, the results vary greatly and cannot be reproduced1.
Following a series of articles claiming that generative AI could respond to surveys in place of humans, my colleagues have recently highlighted significant and unpredictable variability in simulations of responses to an opinion questionnaire2. They refer to this problem as “machine bias”.
And regarding the danger of proprietary AI, isn’t it possible to get around the problem by working with open-source AI?
Of course, it is possible to replicate an experiment using open-source models, although this does not solve the problem of explainability mentioned above. We could, for example, consider using open-access models by default and only using proprietary models when absolutely necessary, as some have suggested3. An article published in 2024 highlights the value of creating an open-access infrastructure for sociological research to address this issue4. However, this raises questions about the proliferation of models, the storage space required and the environmental cost. It also requires suitable and easily accessible infrastructure.
Are there other safeguards for the proper use of AI in research?
There is a real need for better training for scientists: how AI models work, their limitations, how to use them properly, etc. I think scientists need to be made aware of the dangers of AI, without demonising it, as it can be useful for their work.
Didn’t scientists ask themselves these questions when language models first appeared?
Questions about the dangers of LLM for research, or the best practices to implement, are fairly recent. The first wave of work was marked by enthusiasm from the social science community. That’s what prompted us to publish our article.
Today, there is growing interest in evaluating language models, but it is a complex issue. Until now, it has mainly been the computer science community that has taken on the task of testing the performance of models, particularly because it requires a certain amount of technical expertise. This year, I worked in a team of computer scientists, linguists and sociologists to better incorporate the needs of social sciences into AI evaluation criteria5. This involves paying closer attention to the nature of the test data used. Does good performance on tweets guarantee similar performance on news articles or speeches?
As for the replicability of studies, this is a crisis that was already present in the social sciences. AI is reinforcing the discussions around this topic.
Should we stop or continue to use AI in research?
I think it is essential to reflect on the contributions of AI. Is it of real benefit to research? This requires reliable, scientifically based measurement of the resilience of language models. Another prerequisite is the establishment of a rigorous framework for the use of AI in research. Finally, we need to ask ourselves how dependent the scientific community is on private actors. This carries many risks, particularly for research strategy. If scientists focus on work where AI can help them, this will influence the direction of their research.