bioinformatics

Authors: Stringer, C.; Pachitariu, M.

Score: 106.0, Published: 2024-02-12

DOI: 10.1101/2024.02.10.579780

Generalist methods for cellular segmentation have good out-of-the-box performance on a variety of image types. However, existing methods struggle for images that are degraded by noise, blurred or undersampled, all of which are common in microscopy. We focused the development of Cellpose3 on addressing these cases, and here we demonstrate substantial out-of-the-box gains in segmentation and image quality for noisy, blurry or undersampled images. Unlike previous approaches, which train models to restore pixel values, we trained Cellpose3 to output images that are well-segmented by a generalist segmentation model, while maintaining perceptual similarity to the target images. Furthermore, we trained the restoration models on a large, varied collection of datasets, thus ensuring good generalization to user images. We provide these tools as "one-click" buttons inside the graphical interface of Cellpose as well as in the Cellpose API.

Authors: Chatterjee, S.; Mahata, J.; Kateriya, S.; Anirudhan, G.

Score: 85.4, Published: 2024-02-12

DOI: 10.1101/2024.02.11.578752

The influence of SARS-CoV-2 non-structural protein in the hosts tissue-specific complexities remains a mystery and needs more in-depth attention because of COVID-19 recurrence and long COVID. Here we investigated the influence of SARS-CoV-2 transmembrane protein NSP6 (Non-structural protein 6) in three major organs - the brain, heart, and lung in silico. To elucidate the interplay between NSP6 and host proteins, we analyzed the protein-protein interaction network of proteins interacting with NSP6 interacting proteins. Reported host interacting partners of NSP6 were ATP5MG, ATP6AP1, ATP13A3, and SIGMAR1. Pathway enrichment analyses provided global insights into biological pathways governed by differentially regulated genes in the three tissues after COVID-19 infection. Hub genes of tissue-specific protein interactome were analysed for drug targets and many were found. miRNA-gene network for the tissue-specific regulated proteins was sought. Comparing this list with the gene list targetted by SARS-CoV-2 regulated miRNAs, we found three and two common genes in the brain and lung respectively. Among the five common proteins revealed as potential therapeutic targets across the three tissues, four non-approved drugs and one approved drug could target Galectin 3 (LGALS3) and AIFM1 respectively. Increased expression of LGALS3 (that was upregulated in the heart after COVID-19 infection) is observed in multiple cancers and acts as a modulator for tumor progression. COVID-19 infection also causes myocardial inflammation and heart failure (HF). HF is observed to be increasing cancer incidence. The present scenario of long COVID-19 and recurrent COVID-19 infections warrants in-depth studies to probe the effect of COVID-19 infection on increased cancer incidence. Graphical Abstract O_FIG O_LINKSMALLFIG WIDTH=200 HEIGHT=111 SRC="FIGDIR/small/578752v1_ufig1.gif" ALT="Figure 1"> View larger version (41K): org.highwire.dtl.DTLVardef@1235495org.highwire.dtl.DTLVardef@8604d5org.highwire.dtl.DTLVardef@a76646org.highwire.dtl.DTLVardef@2bad2f_HPS_FORMAT_FIGEXP M_FIG C_FIG

Authors: Li, T.

Score: 26.8, Published: 2024-02-13

DOI: 10.1101/2024.02.12.579906

Metabolic pathways are fundamental maps in biochemistry that detail how molecules are transformed through various reactions. Metabolomics refers to the large-scale study of small molecules. High-throughput, untargeted, mass spectrometry-based metabolomics experiments typically depend on libraries for structural annotation, which is necessary for pathway analysis. However, only a small fraction of spectra can be matched to known structures in these libraries and only a portion of annotated metabolites can be associated with specific pathways, considering that numerous pathways are yet to be discovered. The complexity of metabolic pathways, where a single compound can play a part in multiple pathways, poses an additional challenge. This study introduces a different concept: mass-weighted intensity distribution, which is the empirical distribution of the intensities times their associated m/z values. Analysis of COVID-19 and mouse brain datasets shows that by estimating the differences of the point estimations of these distributions, it becomes possible to infer the metabolic directions and magnitudes without requiring knowledge of the exact chemical structures of these compounds and their related pathways. The overall metabolic momentum map, named as momentome, has the potential to bypass the current bottleneck and provide fresh insights into metabolomics studies. This brief report thus provides a mathematical framing for a classic biological concept.

Authors: Wang, C.; Qu, Y.; Peng, Z.; Wang, Y.; Zhu, H.; Chen, D.; Cao, L.

Score: 24.6, Published: 2024-02-15

DOI: 10.1101/2024.02.10.579791

The development of de novo protein design method is crucial for widespread applications in biology and chemistry. Protein backbone diffusion aims to generate designable protein structures with high efficiency. Although there have been great strides in protein structure prediction, applying these methods to protein diffusion has been challenging and inefficient. We introduce Proteus, an innovative approach that uses graph-based triangle methods and a multi-track interaction network. Proteus demonstrated cutting-edge designability and efficiency in computational evaluations. We tested the reliability of the model by experimental characterization. Our analysis indicates that from both computational and experimental perspectives, it is able to generate proteins with a remarkably high success rate. We believe Proteus's capacity to rapidly create highly designable protein backbone without the need for pre-training techniques will greatly enhance our understanding of protein structure diffusion and contribute to advances in protein design.

Authors: Frazer, S. A.; Baghbanzadeh, M.; Rahnavard, A.; Crandall, K. A.; Oakley, T. H.

Score: 21.6, Published: 2024-02-14

DOI: 10.1101/2024.02.12.579993

BackgroundPredicting phenotypes from genetic variation is foundational for fields as diverse as bioengineering and global change biology, highlighting the importance of efficient methods to predict gene functions. Linking genetic changes to phenotypic changes has been a goal of decades of experimental work, especially for some model gene families including light-sensitive opsin proteins. Opsins can be expressed in vitro to measure light absorption parameters, including {lambda}max - the wavelength of maximum absorbance - which strongly affects organismal phenotypes like color vision. Despite extensive research on opsins, the data remain dispersed, uncompiled, and often challenging to access, thereby precluding systematic and comprehensive analyses of the intricate relationships between genotype and phenotype. ResultsHere, we report a newly compiled database of all heterologously expressed opsin genes with {lambda}max phenotypes called the Visual Physiology Opsin Database (VPOD). VPOD_1.0 contains 864 unique opsin genotypes and corresponding {lambda}max phenotypes collected across all animals from 73 separate publications. We use VPOD data and deepBreaks to show regression-based machine learning (ML) models often reliably predict {lambda}max, account for non-additive effects of mutations on function, and identify functionally critical amino acid sites. ConclusionThe ability to reliably predict functions from gene sequences alone using ML will allow robust exploration of molecular-evolutionary patterns governing phenotype, will inform functional and evolutionary connections to an organisms ecological niche, and may be used more broadly for de-novo protein design. Together, our database, phenotype predictions, and model comparisons lay the groundwork for future research applicable to families of genes with quantifiable and comparable phenotypes. Key PointsO_LIWe introduce the Visual Physiology Opsin Database (VPOD_1.0), which includes 864 unique animal opsin genotypes and corresponding {lambda}max phenotypes from 73 separate publications. C_LIO_LIWe demonstrate that regression-based ML models can reliably predict {lambda}max from gene sequence alone, predict non-additive effects of mutations on function, and identify functionally critical amino acid sites. C_LIO_LIWe provide an approach that lays the groundwork for future robust exploration of molecular-evolutionary patterns governing phenotype, with potential broader applications to any family of genes with quantifiable and comparable phenotypes. C_LI

Authors: Coombe, L.; Kazemi, P.; Wong, J.; Birol, I.; Warren, R. L.

Score: 24.8, Published: 2024-02-13

DOI: 10.1101/2024.02.07.579356

In recent years, the landscape of reference-grade genome assemblies has seen substantial diversification. With such rich data, there is pressing demand for robust tools for scalable, multi-species comparative genomics analyses, including detecting genome synteny, which informs on the sequence conservation between genomes and contributes crucial insights into species evolution. Here, we introduce ntSynt, a scalable utility for computing large-scale multi-genome synteny blocks using a minimizer graph-based approach. Through extensive testing utilizing multiple [~]3 Gbp genomes, we demonstrate how ntSynt produces synteny blocks with coverages between 79-100% in at most 2h using 34 GB of memory, even for genomes with appreciable (>15%) sequence divergence. Compared to existing state-of-the-art methodologies, ntSynt offers enhanced flexibility to diverse input genome sequences and synteny block granularity. We expect the macrosyntenic genome analyses facilitated by ntSynt will have broad utility in generating critical evolutionary insights within and between species across the tree of life.

Authors: Madrigal, G.; Minhas, B. F.; Catchen, J. M.

Score: 18.1, Published: 2024-02-15

DOI: 10.1101/2024.02.14.580330

The improvement and decreasing costs of third-generation sequencing technologies has widened the scope of biological questions researchers can address with de novo genome assemblies. With the increasing number of reference genomes, validating their integrity with minimal overhead is vital for establishing confident results in their applications. Here, we present Klumpy, a tool for detecting and visualizing both misassembled regions in a genome assembly and genetic elements (e.g., genes) of interest in a set of sequences. By leveraging the initial raw reads in combination with their respective genome assembly, we illustrate Klumpy's utility by investigating antifreeze glycoprotein (afgp) loci across two icefishes, by searching for a reported absent gene in the northern snakehead fish, and by scanning the reference genomes of a mudskipper and bumblebee for misassembled regions. In the two former cases, we were able to provide support for the noncanonical placement of an afgp locus in the icefishes and locate the missing snakehead gene. Furthermore, our genome scans were able to identify a cryptic locus in the mudskipper reference genome, and identify a putative repetitive element shared amongst several species of bees.

Authors: Goni, E.; Mas, A. M.; Abad, A.; Santisteban, M.; Fortes, P.; Huarte, M.; Hernaez, M.

Score: 19.0, Published: 2024-02-12

DOI: 10.1101/2024.01.26.577344

Long non-coding RNAs (lncRNAs) play fundamental roles in cellular processes and pathologies, regulating gene expression at multiple levels. Despite being highly cell type-specific, their study at single-cell (sc) level has been challenging due to their less accurate annotation and low expression compared to protein-coding genes. To identify the important, albeit widely overlooked, specific lncRNAs from scRNA-seq data, here, we develop a computational framework, ELATUS, based on the pseudoaligner Kallisto that enhances the detection of functional lncRNAs previously undetected and exhibits higher concordance with the ATAC-seq profiles in single-cell multiome data. Importantly, we then independently confirmed the expression patterns of cell type-specific lncRNAs exclusively detected with ELATUS and unveiled biologically important lncRNAs, such as AL121895.1, a previously undocumented cis-repressor lncRNA, whose role in breast cancer progression was unnoticed by traditional methodologies. Our results emphasize the necessity for an alternative scRNA-seq workflow tailored to lncRNAs that sheds light on the multifaceted roles of lncRNAs.

Authors: Vilov, S.; Heinig, M.

Score: 13.4, Published: 2024-02-12

DOI: 10.1101/2024.02.09.579631

Foundation models, such as DNABERT and Nucleotide Transformer have recently shaped a new direction in DNA research. Trained in an unsupervised manner on a vast quantity of genomic data, they can be used for a variety of downstream tasks, such as promoter prediction, DNA methylation prediction, gene network prediction or functional variant prioritization. However, these models are often trained and evaluated on entire genomes, neglecting genome partitioning into different functional regions. In our study, we investigate the efficacy of various unsupervised approaches, including genome-wide and 3UTR-specific foundation models on human 3UTR regions. Our evaluation includes downstream tasks specific for RNA biology, such as recognition of binding motifs of RNA binding proteins, detection of functional genetic variants, prediction of expression levels in massively parallel reporter assays, and estimation of mRNA half-life. Remarkably, models specifically trained on 3UTR sequences demonstrate superior performance when compared to the established genome-wide foundation models in three out of four downstream tasks. Our results underscore the importance of considering genome partitioning into functional regions when training and evaluating foundation models.

Authors: Kim, J.; Ionita, M.; Lee, M.; McKeague, M. L.; Pattekar, A.; Painter, M. M.; Wagenaar, J.; Truong, V.; Norton, D. T.; Mathew, D.; Nam, Y.; Apostolidis, S. A.; Clendenin, C.; Orzechowski, P.; Jung, S.-H.; Woerner, J.; Ittner, C. A. G.; Turner, A. P.; Esperanza, M.; Dunn, T. G.; Mangalmurti, N. S.; Reilly, J. P.; Meyer, N. J.; Calfee, C. S.; Liu, K. D.; Matthy, M. A.; Swigart, L. B.; Burnham, E. L.; McKeehan, J.; Gandotra, S.; Russel, D. W.; Gibbs, K. W.; Thomas, K. W.; Barot, H.; Greenplate, A. R.; Wherry, E. J.; Kim, D.

Score: 12.3, Published: 2024-02-14

DOI: 10.1101/2024.02.13.580114

High-throughput single-cell cytometry data are crucial for understanding immune system's involvement in diseases and responses to treatment. Traditional methods for annotating cytometry data, specifically manual gating and clustering, face challenges in scalability, robustness, and accuracy. In this study, we propose a single-cell masked autoencoder (scMAE), which offers an automated solution for immunophenotyping tasks including cell type annotation. The scMAE model is designed to uphold user-defined cell type definitions, thereby facilitating easier interpretation and cross-study comparisons. The scMAE model operates on a pre-train and fine-tune approach. In the pre-training phase, scMAE employs Masked Single-cell Modelling (MScM) to learn relationships between protein markers in immune cells solely based on protein expression, without relying on prior information such as cell identity and cell type-specific marker proteins. Subsequently, the pre-trained scMAE is fine-tuned on multiple specialized tasks via task-specific supervised learning. The pre-trained scMAE addresses the shortcomings of manual gating and clustering methods by providing accurate and interpretable predictions. Through validation across multiple cohorts, we demonstrate that scMAE effectively identifies co-occurrence patterns of bound labeled antibodies, delivers accurate and interpretable cellular immunophenotyping, and improves the prediction of subject metadata status. Specifically, we evaluated scMAE for cell type annotation and imputation at the cellular-level and SARS-CoV-2 infection prediction, secondary immune response prediction against COVID-19, and prediction the infection stage in the COVID-19 progression at the subject-level. The introduction of scMAE marks a significant step forward in immunology research, particularly in large-scale and high-throughput human immune profiling. It offers new possibilities for predicting and interpretating cellular-level and subject-level phenotypes in both health and disease.

Cellpose3: one-click image restoration for improved cellular segmentation

Pathways in the brain, heart, and lung influenced by SARS-CoV-2 NSP6 and SARS-CoV-2 regulated miRNAs: an in silico study hinting cancer incidence

Infer metabolic directions and magnitudes from moment differences of mass-weighted intensity distributions

Proteus: exploring protein structure generation for enhanced designability and efficiency

Discovering genotype-phenotype relationships with machine learning and the Visual Physiology Opsin Database (VPOD)

Multi-genome synteny detection using minimizer graph mappings

Klumpy: A tool to evaluate the integrity of long-read genome assemblies and illusive sequence motifs

Uncovering functional lncRNAs by scRNA-seq with ELATUS

Investigating the performance of foundation models on human 3'UTR sequences

Single-cell Masked Autoencoder: An Accurate and Interpretable Automated Immunophenotyper