Caroline Uhler is Erna Viterbi, professor of engineering at Andrew (1956) and MIT (MIT); professor of electrical engineering and computer science at the Institute of Data, Science and Society (IDSS); director of the Eric and Wendy Schmidt Center at the MIT and Harvard University’s Extensive Institute, and she is also a member of the Core Institute and Science Leadership Team.
Ull is interested in all the ways scientists can discover causality in biological systems, ranging from causal discovery of observed variables to causal feature learning and representative learning. In this interview, she discusses machine learning in biology, problem-solving areas, and cutting-edge research at the Schmidt Center.
ask: The Eric and Wendy Schmidt Center have four different areas of focus, surrounding four natural levels of biological tissue: proteins, cells, tissues, and organisms. In the current landscape of machine learning, is now the right time to work on these specific problem categories?
one: Biology and medicine are currently undergoing a “data revolution”. From genomics and multiwords to large scales of high-resolution imaging and electronic health records, the availability of diverse datasets makes this the right time. Cheap and accurate DNA sequencing is a reality, late molecular imaging has become routine, and single-cell genomics allows analysis of millions of cells. These innovations and the vast amount of data they produce have brought us to the threshold of a new era of biology, and we will be able to go beyond characterizing the units of life (such as all proteins, genes and cell types) to understand the “programs of life” to understand the “logics of gene circuits and cells” that are constructed by the mechanisms that are shaped and cellular, and that is frictional and cellular. map.
Meanwhile, over the past decade, machine learning has made amazing progress in models such as BERT, GPT-3, and CHATGPT, which demonstrate advanced capabilities in text comprehension and generation, while Vision Transformers and multimodals such as CLIP have achieved human-level performance in image-related tasks. These breakthroughs provide powerful architectural blueprints and training strategies that can adapt to biological data. For example, a transformer can model genomic sequences similar to language, while a visual model can analyze medical and microscopic images.
Importantly, biology is expected to be a beneficiary of machine learning and also an important source of inspiration for new ML research. Just as agriculture and reproduction stimulate modern statistics, biology has the potential to inspire new and even deeper avenues of ML research. Unlike fields such as recommendation systems and Internet advertising, there is no natural law to discover and predict accuracy as the ultimate measure of value, biologically, phenomena are physically interpretable, and causal mechanisms are the ultimate goal. Furthermore, biology has genetic and chemical tools that can be perturbed at an unrivalled scale compared to other fields. These combined features make biology uniquely suited to benefit greatly from ML and provide a profound source of inspiration.
ask: Taking some different nails, what questions in biology are still really resistant to our current tool set? Do you face specific challenges in terms of disease or health, and do you think the areas to solve the problem are mature?
one: Machine learning has achieved great success in prediction tasks across domains, such as image classification, natural language processing, and clinical risk modeling. However, in biological sciences, prediction accuracy is often insufficient. The fundamental problem in these fields is essentially causality: How does perturbation of a particular gene or pathway affect downstream cellular processes? What is the mechanism by which intervention leads to phenotypic changes? Traditional machine learning models are mainly used to capture statistical associations in observed data, and are usually unable to answer such intervening queries. The strong demand for biology and medicine has also inspired new fundamental developments in machine learning.
The field is now equipped with high-throughput perturbation technologies such as merged CRISPR screens, single-cell transcriptomics and spatial analysis – generating rich data sets under systematic intervention. These data patterns naturally require the development of models that go beyond pattern recognition to support causal reasoning, active experimental design, and representative learning, and have complex structural latent variables. From a mathematical point of view, this requires solving the core issues of recognizability, sample efficiency, and integration of combinatorial, geometric and probability tools. I think solving these challenges will not only unleash new insights on the mechanisms of cellular systems, but also push the theoretical boundaries of machine learning.
Regarding the basic model, the consensus in the field is that we are not yet creating a holistic basic model for biology across scales, similar to what Chatgpt represents in the field of languages, a digital organism that can simulate all biological phenomena. Although new underlying models appear almost every week, so far these models have been specifically targeted at specific scales and questions and focused on one or several ways.
Significant progress has been made in predicting protein structures in its sequence. This success underlines the importance of iterative machine learning challenges such as CASP (Critical Evaluation of Structural Prediction), which test state-of-the-art algorithms based on the latest algorithms for protein structure prediction.
The Schmidt Center is organizing challenges to raise awareness in the ML field and make progress in addressing the development of causal prediction issues that are crucial to the biomedical sciences. As single gene perturbation data increases at the single-cell level, I believe it is possible to predict the effects of single or combined perturbations, which perturbations may drive the desired phenotype, which is a problem that can be solved. Through our Cell Perturbation Prediction Challenge (CPPC), we aim to provide methods for objective testing and benchmarking algorithms to predict the effects of new perturbations.
Another area where the field has achieved significant strides is disease diagnosis and patient classification. Machine learning algorithms can integrate different sources of patient information (data patterns), generate missing ways, identify patterns that may be difficult to detect, and help stratify based on patient’s disease risk. Although we must be cautious about potential biases in model predictions, the dangers of model learning shortcuts rather than real correlations and the risk of automating bias in clinical decision-making, I believe this is an area where machine learning has had a significant impact.
ask: Let’s talk about some of the headlines that have come out of the Schmidt Center recently. You think people should be particularly excited, why?
one: In collaboration with Dr. Fei Chen of the Broad Institute, we recently developed a method to predict the subcellular location of invisible proteins, called PUP. Many existing methods can only predict based on the specific protein and cellular data they train. However, pups combine protein language models with models in the images to utilize both protein sequences and cellular images. We demonstrate that protein sequence input makes proteins disappear from generalization, and cell image input captures variability in single cells, thus achieving cell type specific prediction. This model understands how much each amino acid residue correlates with predicted subcellular localization and can predict localization changes due to protein sequence mutations. Since proteins’ function is strictly related to their subcellular localization, our predictions can provide insights into underlying disease mechanisms. In the future, we aim to extend this approach to predict the localization of multiple proteins in cells and possibly understand protein-protein interactions.
Together with Professor GV Shivashankar, a longtime collaborator at EthZürich, we have previously shown how images of cells stained with fluorescent DNA insertion dyes can label chromatin for chromatin, which can generate a wealth of information about health and disease states and disease states. Recently, we demonstrated a deep connection between chromatin tissue and gene regulation by developing Image2Reg, which can predict unseen genetic or chemically perturbed genes from chromatin images. Image2Reg uses convolutional neural networks to learn information representations of chromatin images that perturb cells. It also employs graphical convolutional networks to create a gene embedding based on gene effects based on protein-protein interaction data and integrates with cell-type-specific transcriptome data. Finally, it learns a graph between the physiological and biochemical representations of cells, allowing us to predict disturbed gene modules based on chromatin images.
Furthermore, we recently finalized a method to predict invisible combined gene perturbation results and identified the types of interactions that occur between perturbation genes. Variants can guide the most useful perturbation design of experimental experiments. Furthermore, attention-based frameworks can demonstrate that our approach can identify causal relationships between genes, thus providing insights into underlying gene regulation programs. Finally, due to its modular structure, we can apply variants to perturbing data measured in various ways, including not only transcriptomics but also imaging. We are very excited about the potential of this approach to enable the perturbation space to be explored effectively to improve our understanding of cellular planning by bridging causal theory into important applications, which have implications for basic research and therapeutic applications.