Detecting higher-order structural changes in 3D genome organization with multi-task matrix factorization

Abstract

Three-dimensional (3D) genome organization, or how the DNA is packaged inside the nucleus, has emerged as a key regulatory mechanism of cellular function and malfunction. High-throughput chromosomal conformation capture (Hi-C) technologies have enabled the study of 3D genome organization by experimentally measuring interactions among genomic regions in 3D space. Analysis of Hi-C data has revealed higher-order organizational units at multiple resolutions: chromosomal territories, compartments, and topologically associating domains (TADs). Changes or disruptions to such structures have been associated with disease, development, and evolution. Therefore, a key problem is to systematically detect higher-order structural changes across Hi-C datasets from multiple conditions. Existing methods to address this problem are limiting in that they identify changes at individual interaction level, making them sensitive to sparsity and noise, only provide a coarse summary metric to measure differences, specialize in detecting changes at a single structural resolution, and/or offer only pairwise comparisons. We address these limitations with Tree-structured Graph-regularized Integrated Factorization (TGIF). TGIF is based on Non-negative Matrix Factorization (NMF), a dimensionality reduction method for co-clustering row and column entities of a matrix. NMF can represent a high-dimensional Hi-C matrix in low-dimensional space as clusters of genomic regions with similar interaction profiles. TGIF extends an existing multi-task framework by constraining the lower-dimensional factors from closely related tasks to be similar. Additional information about the relationship among the row entities can be encoded into a task-specific graph to constrain the factorization. We demonstrate our framework effectively recovers ground-truth clusters in simulated data and can detect biologically meaningful structural changes in Hi-C datasets from cancer cell lines and mouse neural development.

Date
Nov 23, 2020
Location
Virtual/online

Related