Authors: Unlu, O.; Shin, J.; Mailly, C. J.; Oates, M. F.; Tucci, M. R.; Varugheese, M.; Wagholikar, K.; Wang, F.; Scirica, B. M.; Blood, A. J.; Aronson, S. J.

Score: 62.9, Published: 2024-02-08

DOI: 10.1101/2024.02.08.24302376

BackgroundSubject screening is a key aspect of all clinical trials; however, traditionally, it is a labor-intensive and error-prone task, demanding significant time and resources. With the advent of large language models (LLMs) and related technologies, a paradigm shift in natural language processing capabilities offers a promising avenue for increasing both quality and efficiency of screening efforts. This study aimed to test the Retrieval-Augmented Generation (RAG) process enabled Generative Pretrained Transformer Version 4 (GPT-4) to accurately identify and report on inclusion and exclusion criteria for a clinical trial. MethodsThe Co-Operative Program for Implementation of Optimal Therapy in Heart Failure (COPILOT-HF) trial aims to recruit patients with symptomatic heart failure. As part of the screening process, a list of potentially eligible patients is created through an electronic health record (EHR) query. Currently, structured data in the EHR can only be used to determine 5 out of 6 inclusion and 5 out of 17 exclusion criteria. Trained, but non-licensed, study staff complete manual chart review to determine patient eligibility and record their assessment of the inclusion and exclusion criteria. We obtained the structured assessments completed by the study staff and clinical notes for the past two years and developed a workflow of clinical note-based question answering system powered by RAG architecture and GPT-4 that we named RECTIFIER (RAG-Enabled Clinical Trial Infrastructure for Inclusion Exclusion Review). We used notes from 100 patients as a development dataset, 282 patients as a validation dataset, and 1894 patients as a test set. An expert clinician completed a blinded review of patients charts to answer the eligibility questions and determine the "gold standard" answers. We calculated the sensitivity, specificity, accuracy, and Matthews correlation coefficient (MCC) for each question and screening method. We also performed bootstrapping to calculate the confidence intervals for each statistic. ResultsBoth RECTIFIER and study staff answers closely aligned with the expert clinician answers across criteria with accuracy ranging between 97.9% and 100% (MCC 0.837 and 1) for RECTIFIER and 91.7% and 100% (MCC 0.644 and 1) for study staff. RECTIFIER performed better than study staff to determine the inclusion criteria of "symptomatic heart failure" with an accuracy of 97.9% vs 91.7% and an MCC of 0.924 vs 0.721, respectively. Overall, the sensitivity and specificity of determining eligibility for the RECTIFIER was 92.3% (CI) and 93.9% (CI), and study staff was 90.1% (CI) and 83.6% (CI), respectively. ConclusionGPT-4 based solutions have the potential to improve efficiency and reduce costs in clinical trial screening. When incorporating new tools such as RECTIFIER, it is important to consider the potential hazards of automating the screening process and set up appropriate mitigation strategies such as final clinician review before patient engagement.

Authors: Ponting, C. P.; Samms, G. L.

Score: 65.3, Published: 2024-02-01

DOI: 10.1101/2024.01.31.24302070

BackgroundPeople with Myalgic Encephalomyelitis (ME/CFS; sometimes referred to as chronic fatigue syndrome) experience very poor health-related quality of life and only rarely recover. ME/CFS has no curative treatment and no single diagnostic test. Public health and policy decisions relevant to ME/CFS require knowledge of its prevalence and barriers to diagnosis. However, people with ME/CFS report lengthy diagnostic delays and widespread misunderstanding of their symptoms. Published prevalence estimates vary greatly by country, gender, age and ethnicity. MethodsHospital Episode Statistics data is routinely collected by the NHS in England together with patient age, gender and ethnicity. This data, downloaded from the Feasibility Self-Service of NHS DigiTrials, was used to stratify individuals with the ICD-10 code that best reflects ME/CFS symptoms (G93.3; "Postviral fatigue syndrome") according to their age, self-reported gender and ethnicity, General Practice and NHS England Integrated Care Board (ICB). ResultsIn all, 100,055 people in England had been diagnosed with ME/CFS (ICD-10:G93.3) between April 1 1989 and October 7 2023, 0.16% of all registered patients. Of these, 79,445 were females and 20,590 males, a female-to-male ratio of 3.88:1. Female relative to male prevalence peaked at about 6-to-1 in individuals fourth and fifth decades of life. Prevalence varied widely across the 42 ICBs: 0.086%-0.82% for females and 0.024%-0.21% for males. White individuals were approximately 5-fold more likely to be diagnosed with ME/CFS than others; black, Asian or Chinese ethnicities are associated with particularly low rates of ME/CFS diagnoses. This ethnicity bias is stronger than for other common diseases. Among active English GP practices, 176 (3%) had no registered ME/CFS patients. Eight ICBs (19%) each contained fewer than 8 other-than-white individuals with a G93.3 code despite their registers containing a total of 293,770 other-than-white patients. ConclusionThose who are disproportionately undiagnosed with ME/CFS are other-than-white ethnic groups, older females (>60y), older males (>80y), and people living in areas of multiple deprivation. The lifetime prevalence of ME/CFS for English females and males may be as high as 0.92% and 0.25%, respectively, or approximately 390,000 UK individuals overall. This improved estimate of ME/CFS prevalence allows more accurate assessment of the socioeconomic and disease burden imposed by ME/CFS.

Authors: Shanmugam, P.; Bair, M.; Pendl-Robinson, E.; Hu, X. C.

Score: 8.1, Published: 2024-02-09

DOI: 10.1101/2024.02.08.24302528

With hundreds of millions of COVID-19 infections to date, a considerable portion of the population has developed or will develop long COVID. Understanding the prevalence, risk factors, and healthcare costs of long COVID can be of significant societal importance. To investigate the utility of large-scale electronic health record (EHR) data in identifying and predicting long COVID, we analyzed data from the National COVID Cohort Collaborative (N3C), a longitudinal EHR data repository from 65 sites in the US with over 8 million COVID-19 patients. We characterized the prevalence of long COVID using a few different types of definition to illustrate their relative strengths and weaknesses. Then we developed a machine learning model to predict the risk of developing long COVID using demographic factors and comorbidity in the EHR. The risk factors for long COVID include patient age; sex; smoking status; and comorbidities characterized by the Charlson Comorbidity Index.

Authors: Gravel, J.; Dion, C.; Fadaei Kermani, M.; Mousseau, S.; Osmanlliu, E.

Score: 1.5, Published: 2024-02-11

DOI: 10.1101/2024.02.09.24302591

BackgroundChatGPT received recognition for medical writing. Our objective was to evaluate whether ChatGPT 4.0 could improve the quality of abstracts submitted to a medical conference by clinical researchers. MethodsThis was an experimental study involving 24 international researchers who provided one original abstract intended for submission at the 2024 Pediatric Academic Society (PAS) conference. We created a prompt asking ChatGPT-4 to improve the quality of the abstract while adhering PAS submission guidelines. Researchers received the revised version and were tasked with creating a final abstract. The quality of each version (original, ChatGPT and final) was evaluated by the researchers themselves using a numeric scale (0-100). Additionally, three co-investigators assessed abstracts blinded to the version. The primary analysis focused on the mean difference in scores between the final and original abstracts. ResultsAbstract quality varied between the three versions with mean scores of 82, 65 and 90 for the original, ChatGPT and final versions, respectively. Overall, the final version displayed significantly improved quality compared to the original (mean difference 8.0 points; 95% CI: 5.6-10.3). Independent ratings by the co-investigator confirmed statistical improvements (mean difference 1.10 points; 95% CI: 0.54-1.66). Researchers identified minor (n=10) and major (n=3) factual errors in ChatGPTs abstracts. ConclusionWhile ChatGPT 4.0 does not produce abstracts of better quality then the one crafted by researchers, it serves as a valuable tool for researchers to enhance the quality of their own abstracts. The utilization of such tools is a potential strategy for researchers seeking to improve their abstracts. FundingNone