The TRIPOD (Transparent Reporting of a multivariable prediction model for Individual Prognosis Or Diagnosis) statement was published in 2015 to provide the minimum reporting recommendations for studies developing or evaluating the performance of a prediction model. Methodological advances in the field of prediction have since included the widespread use of artificial intelligence (AI) powered by machine learning methods to develop prediction models. An update to the TRIPOD statement is thus needed. TRIPOD+AI provides harmonised guidance for reporting prediction model studies, irrespective of whether regression modelling or machine learning methods have been used. The new checklist supersedes the TRIPOD 2015 checklist, which should no longer be used. This article describes the development of TRIPOD+AI and presents the expanded 27 item checklist with more detailed explanation of each reporting recommendation, and the TRIPOD+AI for Abstracts checklist. TRIPOD+AI aims to promote the complete, accurate, and transparent reporting of studies that develop a prediction model or evaluate its performance. Complete reporting will facilitate study appraisal, model evaluation, and model implementation.
Contributors: Gary S Collins, Karel G M Moons, Paula Dhiman, Richard D Riley, Andrew L Beam, Ben Van Calster, Xiaoxuan Liu, Johannes B Reitsma, Maarten van Smeden, Anne-Laure Boulesteix, Jennifer Catherine Camaradou, Leo Anthony Celi, Spiros Denaxas, Alastair K Denniston, Ben Glocker, Robert M Golub, Hugh Harvey, Georg Heinze, Michael M Hoffman, André Pascal Kengne, Emily Lam, Naomi Lee, Elizabeth W Loder, Lena Maier-Hein, Bilal A Mateen, Melissa D McCradden, Lauren Oakden-Rayner, Johan Ordish, Richard Parnell, Sherri Rose, Karandeep Singh, Laure Wynants, Patricia Logullo
Learn more
Parameter-efficient fine-tuning optimizes large, pre-trained foundation models by updating a subset of parameters; in this class, Low-Rank Adaptation (LoRA) is particularly effective. Inspired by an effort to investigate the different roles of LoRA matrices during fine-tuning, this paper characterizes and leverages unexpected asymmetry in the importance of low-rank adapter matrices. Specifically, when updating the parameter matrices of a neural network by adding a product BA, we observe that the B and A matrices have distinct functions: A extracts features from the input, while B uses these features to create the desired output. Based on this observation, we demonstrate that fine-tuning B is inherently more effective than fine-tuning A, and that a random untrained A should perform nearly as well as a fine-tuned one. Using an information-theoretic lens, we also bound the generalization of low-rank adapters, showing that the parameter savings of exclusively training B improves the bound. We support our conclusions with experiments on RoBERTa, BART-Large, LLaMA-2, and ViTs.
Contributors: Jiacheng Zhu, Kristjan Greenewald, Kimia Nadjahi, Haitz Sáez de Ocáriz Borde, Rickard Brüel Gabrielsson, Leshem Choshen, Mikhail Yurochkin, Justin Solomon Learn more
To the Editor:
Substance use disorders (SUDs) and overdose deaths continue at record levels in the USA. One major barrier to adequate treatment is the stigma attached to the condition. Evidence suggests that clinicians have more negative attitudes and less empathy toward patients with SUDs compared to other medical and mental health conditions, thereby affecting the overall quality of care these patients receive [1]. Stigma can become apparent during clinical interactions where providers may unintentionally convey negative emotions or judgments through their facial expressions.
Until recently, empathy toward this patient population was previously thought of as an inherent trait that could not be taught. However, studies in the medical literature have shown that medical trainees do have the capability to improve their empathy toward patients [2]. Given that a physician’s ability to communicate effectively is associated with better patient outcomes, it is imperative to educate future physicians about how stigma manifests in the clinical setting and the importance of empathetic communication.
A promising approach to achieving this goal is through a technology called affective computing, also called emotional artificial intelligence. Affective computing enables computers to recognize, interpret, process, and simulate human emotion. Researchers from the MIT Media Lab at the Massachusetts Institute of Technology and Weill Cornell Medical College have developed Medship, a computerized training module. Medship leverages affective computing to educate future medical providers about the stigma toward patients with SUDs. It offers interactions with virtual (i.e., computerized) patients who have a SUD, records the user in such interaction, and then simultaneously analyzes the user’s facial expressions to provide feedback on such expressions in real time. The software used was OpenFace, a lightweight, open source toolkit used for facial behavior analysis.
This project is being split into two studies. The initial study aimed to evaluate the usability and acceptability of Medship among medical students. Given the multitude of educational options currently available to medical students, their willingness to adopt the application is pivotal to its success. The second part of this project will be a randomized control trial to assess the module’s impact on decreasing negative attitudes to this patient population and will be critical in evaluating its efficacy.
The initial study of this project used a quantitative interventional design, including a cross-sectional survey following a single session of using Medship. The Institutional Review Board at Weill Cornell Medical College granted approval for the study protocol. The online link for Medship was emailed to medical students during the course of their regular education. All feedback as contained in the module. A total of 26 students at Weill Cornell Medical College participated, providing anonymous responses to demographic questions, a System Usability Scale [3] and a System Quality Scale [4]. Usability refers to the ease of using the module, while acceptability gauges students’ willingness to integrate the module into their medical curriculum.
The results from this pilot study demonstrated positive feedback. Regarding usability, all students found it easy to learn and navigate the module. Most students reported that the module was both enjoyable and user-friendly (n = 20; 77%) and found the graphics to be of high quality and resolution (n = 25; 96%). Participants assigned an average of 85 on the System Usability Scale, where a score of 73 or above indicates satisfactory usability. Regarding acceptability, each student believed that their medical institution should offer Medship as part of the educational curriculum, and a substantial portion felt that medical students would greatly benefit from using the module (n = 20; 77%). On the System Quality Scale, participants rated the module an average of 4, where a score of 3 or higher indicates satisfactory acceptability.
One limitation of Medship is its potential lack of cultural diversity in the inputs that it receives to use in its algorithms that analyze facial action units. Expression of empathy in Western culture often assumes a “one-size-fits-all” approach without taking into consideration intercultural contexts. The current version of Medship is limited from a diversity standpoint in terms of the number of inputs it has from users coming from different backgrounds, cultures, and ethnicities. Future iterations of Medship must address this to enhance external validity.
Previous research has revealed that patients with major depression perceive neutral faces as sad compared to healthy participants who interpret them as happy [5]. It raises the question of whether patients with SUDs might exhibit distinct perceptions of neutral faces, particularly in light of the common comorbidity of SUDs and mood disorders. While one approach could be controlling for these comorbidities, a more clinically valuable direction could be to develop a unique version of Medship addressing patients with SUDs and specific comorbidities.
SUDs are becoming increasingly prevalent, remain significantly undertreated, and are stigmatized by clinicians more so than other medical and psychiatric illnesses. Affective computing is gaining prominence across industries, and the field of medicine is now exploring both its safety and efficacy in enhancing patient care. Medship has the capability of improving empathetic communication between providers and their patients. The first iteration of this study has revealed positive results in terms of the technology’s usability and acceptability by medical students, and the next portion of this study will focus on assessing Medship’s efficacy as an application.
Contributors Michael Woods, Giselle Appel, Aidana Daulbayeva, Caleb Harris, Julia Iyasere, Jonathan Avery Learn more
AI alignment refers to models acting towards human-intended goals, preferences, or ethical principles. Analyzing the similarity between models and humans can be a proxy measure for ensuring AI safety. In this paper, we focus on the models' visual perception alignment with humans, further referred to as AI-human visual alignment. Specifically, we propose a new dataset for measuring AI-human visual alignment in terms of image classification. In order to evaluate AI-human visual alignment, a dataset should encompass samples with various scenarios and have gold human perception labels. Our dataset consists of three groups of samples, namely Must-Act (i.e., Must-Classify), Must-Abstain, and Uncertain, based on the quantity and clarity of visual information in an image and further divided into eight categories. All samples have a gold human perception label; even Uncertain (e.g., severely blurry) sample labels were obtained via crowd-sourcing. The validity of our dataset is verified by sampling theory, statistical theories related to survey design, and experts in the related fields. Using our dataset, we analyze the visual alignment and reliability of five popular visual perception models and seven abstention methods. Our code and data is available at https://github.com/jiyounglee-0523/VisAlign.
Contributors: Jiyoung Lee, Seungho Kim, Seunghyun Won, Joonseok Lee, James Thorne, Jaeseok Choi, O-Kil Kwon, Edward Choi Learn more
Deployed language models decay over time due to shifting inputs, changing user needs, or emergent world-knowledge gaps. When such problems are identified, we want to make targeted edits while avoiding expensive retraining. However, current model editors, which modify such behaviors of pre-trained models, degrade model performance quickly across multiple, sequential edits. We propose GRACE, a \textit{lifelong} model editing method, which implements spot-fixes on streaming errors of a deployed model, ensuring minimal impact on unrelated inputs. GRACE writes new mappings into a pre-trained model's latent space, creating a discrete, local codebook of edits without altering model weights. This is the first method enabling thousands of sequential edits using only streaming errors. Our experiments on T5, BERT, and GPT models show GRACE's state-of-the-art performance in making and retaining edits, while generalizing to unseen inputs. Our code is available at github.com/thartvigsen/grace.
Contributors: Tom Hartvigsen, Swami Sankaranarayanan, Hamid Palangi, Yoon Kim Learn more
When training predictive models on data with missing entries, the most widely used and versatile approach is a pipeline technique where we first impute missing entries and then compute predictions. In this paper, we view prediction with missing data as a two-stage adaptive optimization problem and propose a new class of models, adaptive linear regression models, where the regression coefficients adapt to the set of observed features. We show that some adaptive linear regression models are equivalent to learning an imputation rule and a downstream linear regression model simultaneously instead of sequentially. We leverage this joint-impute-then-regress interpretation to generalize our framework to non-linear models. In settings where data is strongly not missing at random, our methods achieve a 2–10% improvement in out-of-sample accuracy.
Contributors: Arthur Delarue, Jean Pauphilet Learn more
Numerical simulations can model the physical processes that govern cardiovascular device deployment. When such simulations incorporate digital twins; computational models of patient-specific anatomy, they can expedite and de-risk the device design process. Nonetheless, the exclusive use of patient-specific data constrains the anatomic variability which can be precisely or fully explored. In this study, we investigate the capacity of Latent Diffusion Models (LDMs) to edit digital twins to create anatomic variants, which we term digital siblings. Digital twins and their corresponding siblings can serve as the basis for comparative simulations, enabling the study of how subtle anatomic variations impact the simulated deployment of cardiovascular devices, as well as the augmentation of virtual cohorts for device assessment. However, while diffusion models have been characterized in their ability to edit natural images, their capacity to anatomically edit digital twins has yet to be studied. Using a case example centered on 3D digital twins of cardiac
anatomy, we implement various methods for generating digital siblings and characterize them through morphological and topological analyses. We specifically edit digital twins to introduce anatomic variation at different spatial scales and within localized regions, demonstrating the existence of bias towards common anatomic features. We further show that such anatomic bias can be leveraged for virtual cohort augmentation through selective editing, partially alleviating issues related to dataset imbalance and lack of diversity. Our
experimental framework thus delineates the limits and capabilities of using latent diffusion models in synthesizing anatomic variation for in silico trials.
Contributors: Karim Kadry, Shreya Gupta, Farhad R. Nezami Learn more
Joe, who has received a diagnosis of major depressive disorder, is meeting every 2 weeks with his psychiatrist, Sandy.
“How have you been feeling since we last met?” asks Sandy.
“Much better,” says Joe. “I’ve been much more active and social, and I’m sleeping great!”
“That’s wonderful,” says Sandy. “But…I think your wearable must be broken. The data from it looks very irregular for your sleep these past 2 weeks.”
“Oh,” says Joe, “it’s not broken. Actually, now that you mention it, my sleep has been really messed up. I slept well only yesterday.”
“Well,” asks Sandy, “should we talk more about how we can improve your sleep?”
This conversation is based on a real patient–therapist interaction. In this case, the data from wearable technology served as a prompt to obtain details of the patient’s life that might have otherwise been missed. Traditional clinical assessments depend on patient recall. Although such recall can include important factors that wearable technology (often termed “wearables”) do not detect, such as patients’ reports of distress, the assessments by wearables of longitudinal data from daily life may augment methods of monitoring and treating depression, providing objective complements to subjective information from patients.
Contributors: Szymon Fedor, Robert Lewis., Paola Pedrelli, David Mischoulon, Joshua Curtiss Learn more
People are relying on AI agents to assist them with various tasks. The human must know when to rely on the agent, collaborate with the agent, or ignore its suggestions. In this work, we propose to learn rules grounded in data regions and described in natural language that illustrate how the human should collaborate with the AI. Our novel region discovery algorithm finds local regions in the data as neighborhoods in an embedding space that corrects the human prior. Each region is then described using an iterative and contrastive procedure where a large language model describes the region. We then teach these rules to the human via an onboarding stage. Through user studies on object detection and question-answering tasks, we show that our method can lead to more accurate human-AI teams. We also evaluate our region discovery and description algorithms separately.
Contributors: Hussein Mozannar, Jimin J Lee, Dennis Wei, Prasanna Sattigeri, Subhro Das Learn more
Large-scale foundation models, which are pre-trained on massive, unlabeled datasets and subsequently fine-tuned on specific tasks, have recently achieved unparalleled success on a wide array of applications, including in healthcare and biology. In this paper, we explore two foundation models recently developed for single-cell RNA sequencing data, scBERT and scGPT. Focusing on the fine-tuning task of cell type annotation, we explore the relative performance of pre-trained models compared to a simple baseline, L1-regularized logistic regression, including in the few-shot setting. We perform ablation studies to understand whether pretraining improves model performance and to better understand the difficulty of the pre-training task in scBERT. Finally, using scBERT as an example, we demonstrate the potential sensitivity of fine-tuning to hyperparameter settings and parameter initializations. Taken together, our results highlight the importance of rigorously testing foundation models against well established baselines, establishing challenging fine-tuning tasks on which to benchmark foundation models, and performing deep introspection into the embeddings learned by the model in order to more effectively harness these models to transform single-cell data analysis. Code is available at https://github.com/clinicalml/sc-foundation-eval.