Generative artificial intelligence to help researchers synthesize scientific data

In research, it is often necessary to synthesize all the existing studies on a given subject – this is the principle behind systematic reviews. Today, these systematic reviews are mainly carried out by hand.

Prof. Viet-Thi Tran of the METHODS team at the Centre de Recherche Epidémiologie et StatistiqueS (CRESS, Université Paris Cité / INSERM), developed a set of instructions for the GPT 3.5 model (OpenAI, the model used in ChatGPT) to automate the sorting of studies to be included in systematic reviews. The performance of this approach was close to that of human researchers.

The study was published in the journal Annals of Internal Medicine on May 21, 2024.

Every day, a doctor should read seventy-five clinical studies to keep his or her scientific knowledge of disease management up to date (1). The exponential growth of scientific knowledge (it is estimated that the number of scientific articles published doubles every 15 years (2)) makes it necessary to produce systematic reviews, i.e. rigorous, reproducible syntheses of all existing studies answering the same research question.

To date, systematic reviews are still mainly carried out manually by researchers, who sort, read and evaluate all the scientific literature on a given subject. Carrying out a systematic review represents several thousand hours of work for experienced researchers (3). One step in particular is extremely tedious: selecting, from among the thousands of published studies, those that specifically answer the research question and are to be included in the systematic review. What’s more, given the risk of error, this sorting is generally carried out by two researchers in duplicate and independently, to ensure that no important study is missed.

In the present study, a research team led by Prof. Viet-Thi Tran of the Centre de Recherche Epidémiologie et StatistiqueS (CRESS, Université Paris Cité / INSERM) evaluated the performance of GPT 3.5 (OpenAI, the model used in ChatGPT) in screening studies for inclusion or exclusion in systematic reviews, using data from five reviews from the Cochrane Centers in France, Austria and Germany, and including over 22,000 studies. The researchers evaluated two scenarios: 1) the use of GPT 3.5 as a “second reviewer” (replacing one of the two researchers usually involved in the sorting); and 2) the use of GPT 3.5 as a sorting tool, used prior to evaluation by a researcher. Artificial intelligence performance was expressed in terms of sensitivity (i.e. the ability to correctly identify relevant studies; and therefore essential not to bias the results of the systematic review by omitting studies on the subject) and specificity (i.e. the ability to keep “only” these relevant studies when sorting).

In the first scenario, GPT 3.5 was used as a “second reviewer”, to confirm the choices made by a human researcher. The sensitivity of the AI ranged from 81.1% to 96.5%, and was comparable to the performance of a trained researcher. AI specificity ranged from 25.8% to 80.4%, below the performance of a trained researcher. This lower specificity would therefore require additional time to verify the sorting performed by the AI.

In the second scenario, where GPT 3.5 was used as a “sorting tool” prior to evaluation by a human researcher.
The AI had a sensitivity >94.6%, superior to a single researcher and comparable to two researchers performing the sorting in duplicate and independently.Such use would reduce the number of studies to be sorted for the systematic review by up to 45% (i.e., several thousand fewer studies to evaluate), at the risk of missing a maximum of 3.8% of studies included by two humans.Such use of AI to reduce the workload of researchers is particularly interesting when there is an urgent need to synthesize the available evidence (as was the case at COVID-19); or when the research field is very large, with too many articles to be sorted by humans.However, the researchers also showed that the model’s responses varied over time, undermining the reproducibility of systematic reviews.

This research illustrates the potential of generative AI to facilitate and accelerate scientific data synthesis tasks, paving the way for partial automation of these processes.

  1. Bastian H, Glasziou P, Chalmers I. Seventy-five trials and eleven systematic reviews a day : how will we ever keep up ? PLoS Med. 2010;7(9):e1000326.
  2. Bornmann L, Haunschild R, Mutz R. Growth rates of modern science : a latent piecewise growth curve approach to model publication numbers from established and new literature databases. Humanities and Social Sciences Communications. 2021;8(224).
  3. Allen IE, Olkin I. Estimating time to conduct a meta-analysis from number of citations retrieved. Jama. 1999;282(7):634-5.


Back to top