## Spring 2020 Colloquia

**Previous Colloquia:**

**Friday, March 6th: Dr. Ziwei Zhu (University of Michigan)**

214 Duxbury Hall, 10:00am

Title: High-Dimensional Principal Component Analysis with Heterogeneous Missingness

Abstract: In this talk, I will focus on the effect of missing data in Principal Component Analysis (PCA). In simple, homogeneous missingness settings with a noise level of constant order, we show that an existing inverse-probability weighted (IPW) estimator of the leading principal components can (nearly) attain the minimax optimal rate of convergence, and discover a new phase transition phenomenon along the way. However, deeper investigation reveals both that, particularly in more realistic settings where the missingness mechanism is heterogeneous, the empirical performance of the IPW estimator can be unsatisfactory, and moreover that, in the noiseless case, it fails to provide exact recovery of the principal components. Our main contribution, then, is to introduce a new method for high-dimensional PCA, called ``primePCA'', that is designed to cope with situations where observations may be missing in a heterogeneous manner. Starting from the IPW estimator, ``primePCA'' iteratively projects the observed entries of the data matrix onto the column space of our current estimate to impute the missing entries, and then updates our estimate by computing the leading right singular space of the imputed data matrix. It turns out that the interaction between the heterogeneity of missingness and the low-dimensional structure is crucial in determining the feasibility of the problem. We therefore introduce an incoherence condition on the principal components and prove that in the noiseless case, the error of ``primePCA'' converges to zero at a geometric rate when the signal strength is not too small. An important feature of our theoretical guarantees is that they depend on average, as opposed to worst-case, properties of the missingness mechanism. Our numerical studies on both simulated and real data reveal that ``primePCA'' exhibits very encouraging performance across a wide range of scenarios.

**Friday, February 21st: Dr. Aaron Molstad, University of Florida **

214 Duxbury Hall, 10:00am

Title: Multivariate Mixed-Type Response Regression

Abstract: We study the multivariate square-root lasso, a method for fitting the multivariate response (i.e., multi-task) linear regression model with dependent errors. This estimator minimizes the nuclear norm of the residual matrix plus a convex penalty. Unlike some existing methods for multivariate response linear regression, which require explicit estimates of the error co- variance matrix or its inverse, the multivariate square-root lasso criterion implicitly adapts to dependent errors and is convex. To justify the use of this estimator, we establish an error bound which illustrates that like the univariate square-root lasso, the multivariate square-root lasso is pivotal with respect to the unknown error covariance matrix. Based on our theory, we propose a simple tuning approach which requires fitting the model for only a single value of the tuning parameter, e.g., does not require cross-validation. We propose two algorithms to compute the estimator: a prox-linear alternating direction method of multipliers algorithm, and an accelerated first order algorithm which can be applied in certain cases. In both simulation studies and a genomic data application, we show that the multivariate square-root lasso can outperform more computationally intensive methods which estimate both the regression coefficient matrix and error precision matrix.

**Wednesday, January 8th: Fangzheng Xie, Johns Hopkins University**

499 Dirac Science Library, 10:00am

Title: Global and Local Estimation of Low-Rank Random Graphs

Abstract: Random graph models have been a heated topic in statistics and machine learning, as well as a broad range of application areas. In this talk I will give two perspectives on the estimation task of low-rank random graphs. Specifically, I will focus on estimating the latent positions in random dot product graphs. The first component of the talk focuses on the global estimation task. The minimax lower bound for global estimation of the latent positions is established, and this minimax lower bound is achieved by a Bayes procedure, referred to as the posterior spectral embedding. The second component of the talk addresses the local estimation task. We define local efficiency in estimating each individual latent position, propose a novel one-step estimator that takes advantage of the curvature information of the likelihood function (i.e., derivatives information) of the graph model, and show that this estimator is locally efficient. The previously widely adopted adjacency spectral embedding is proven to be locally inefficient due to the ignorance of the curvature information of the likelihood function. Simulation examples and the analysis of a real-world Wikipedia graph dataset are provided to demonstrate the usefulness of the proposed methods.

**Friday, January 10th: Jiwei Zhao, State University of New York at Buffalo**

214 Duxbury Hall, 10:00am

Title: A Unified Statistical Machine Learning Framework for the Minimal Clinically Important Difference

Abstract: The minimal clinically important difference (MCID), defined as the smallest change in a treatment outcome that an individual patient would identify as important and which would indicate a change in the patient’s management, has been a fundamentally critical concept in personalized medicine and population health for decades. However, most of the currently existing methods of determining the MCID are ad hoc, and cannot incorporate the covariate factors emerged dramatically as the use of the electronic health records. In this talk, I will present a principled, unified statistical machine learning framework of estimating the MCID at the population level and at the individual level. For the individual level which incorporates the covariate factors, we consider the traditional low-dimensional case as well as the practical high-dimensional case. In particular, for the high-dimensional case, I will present a path-following iterative algorithm and some exciting theoretical results with nonstandard convergence rate. We conduct comprehensive simulation studies to reinforce these theoretical findings and also apply our method to the study of chondral lesions in knee surgery to demonstrate the usefulness of the proposed approach.

**Monday, January 13th: Yuqi Gu, University of Michigan**

499 Dirac Science Library, 10:00am

Title: Uncover Hidden Fine-Gained Scientific Information: Structured Latent Attribute Models

Abstract: In modern psychological and biomedical research with diagnostic purposes, scientists often formulate the key task as inferring the fine-grained latent information under structural constraints. These structural constraints usually come from the domain experts’ prior knowledge or insight. The emerging family of Structured Latent Attribute Models (SLAMs) accommodate these modeling needs and have received substantial attention in psychology, education, and epidemiology. SLAMs bring exciting opportunities and unique challenges. In particular, with high-dimensional discrete latent attributes and structural constraints encoded by a design matrix, one needs to balance the gain in the model’s explanatory power and interpretability, against the difficulty of understanding and handling the complex model structure.

In the first part of this talk, I present identifiability results that advance the theoretical knowledge of how the design matrix influences the estimability of SLAMs. The new identifiability conditions guide real-world practices of designing diagnostic tests and also lay the foundation for drawing valid statistical conclusions. In the second part, I introduce a statistically consistent penalized likelihood approach to selecting significant latent patterns in the population. I also propose a scalable computational method. These developments explore an exponentially large model space involving many discrete latent variables, and they address the estimation and computation challenges of high-dimensional SLAMs arising from large-scale scientific measurements. The application of the proposed methodology to the data from an international educational assessment reveals meaningful knowledge structure of the student population.

Wednesday, January 15th: Rongjie Liu, Rice University

499 Dirac Science Library, 10:00am

Title: CARP: Compression through Adaptive Recursive Partitioning for Multi-Dimensional Images

Abstract: Multi-dimensional images are a fundamental data type that arises in many areas such as neuroscience, engineering, structural biology, and medicine. Fast and effective image compression for multidimensional images has become increasingly important for efficient storage and transfer of massive amounts of high-resolution images and videos. We propose a hierarchical Bayesian method called CARP that achieves the following desirable properties in compression methods: (1) high reconstruction quality at a wide range of compression rates while preserving key local details, (2) linear computational scalability, (3) applicability to various image/video types and of different dimensions, and (4) ease of tuning. In particular, we infer an optimal permutation of the image pixels from a Bayesian probabilistic model on recursive partitions of the image to reduce its effective dimensionality, leading to a parsimonious representation that preserves information. The- multi-layer Bayesian hierarchical model in use enables self-tuning and regularization to avoid overfitting, resulting in one single parameter to be specified by the user to achieve the desired compression rate. Extensive numerical experiments using a variety of datasets show that CARP dominates the state-of-the-art compression approaches—including JPEG, JPEG2000, MPEG4, and a neural network-based method—for all selected image types and often on nearly all of the individual images.

Friday, January 17th: Jonathan Stewart, Rice University

214 Duxbury Hall, 10:00am

Title: A Probabilistic Framework for Models of Dependent Network Data, With Statistical Guarantees

Abstract: The statistical analysis of network data has attracted considerable attention since the turn of the twenty-first century, fueled by the rise of the internet and social networks and applications in public health (e.g., the spread of infectious diseases through contact networks), national security (e.g., networks of terrorists and cyberterrorists), economics (e.g., networks of financial transactions), and more. While substantial progress has been made on exchangeable random graph models and random graph models with latent structure (e.g., stochastic block models and latent space models), these models make explicit or implicit independence or weak dependence assumptions that may not be satisfied by real-world networks, because network data are dependent data. The question of how to construct models of random graph with dependent edges without sacrificing computational scalability and statistical guarantees is an important question that has received scant attention.

In this talk, I present recent advancements in models, methods, and theory for modeling networks with dependent edges. On the modeling side, I introduce a novel probabilistic framework for specifying edge interactions that allows dependence to propagate throughout the population graph, with applications to brokerage in social networks. On the statistical side, I obtain the first consistency results in settings where dependence propagates throughout the population graph and the number of parameters increases with the number of population members. Key to my approach lies in establishing a direct link between the convergence rate of maximum likelihood estimators and scaling of the Fisher information matrix. Last, but not least, on the computational side I demonstrate how the conditional independence structure of models can be exploited for local computing on subgraphs, which facilitates parallel computing on multi-core computers or computing clusters.

499 Dirac Science Library, 10:00am

Title: On the Trilogy of Nonparametric Methods: Models, Inference, and Misspecification

Abstract: Nonparametric methods provide a flexible framework to understand the relationship between the input and output of complex systems. Despite the successful application of nonparametric methods, many of them do not have a clear characterization of their behavior from a theoretical point of view. In the first part of my talk, I will introduce our recent works on characterizing the upper and lower error bounds of the kriging (Gaussian process regression) predictor under a uniform metric and L_p metric. The kriging predictor can be misspecified. We show that if the design is quasi-uniform and an oversmoothed correlation function is used, the optimal rate can be achieved. I will also introduce an application of our results to Bayesian optimization.

In the second part, I will introduce a multi-resolution functional ANOVA model which can work for large-scale and many-input nonlinear regression problems. This model provides an alternative way to build emulators which are useful in computer experiments. New results on consistency and inference for the overlapping group Lasso in a high-dimensional setting are presented and applied to our multi-resolution functional ANOVA model.

Friday, January 24th: Nicholas Syring, Washington University in St. Louis

214 Duxbury Hall, 10:00am

Title: Gibbs Posterior Distributions

Abstract: Bayesian methods provide a standard framework for statistical inference in which prior beliefs about a population under study are combined with evidence provided by data to produce revised posterior beliefs. As with all likelihood-based methods, Bayesian methods may present drawbacks stemming from model misspecification and over-parametrization. A generalization of Bayesian posteriors, called Gibbs posteriors, link the data and population parameters of interest via a loss function rather than a likelihood, thereby avoiding these potential difficulties. At the same time, Gibbs posteriors retain the prior-to-posterior updating of beliefs. We will illustrate the advantages of Gibbs methods in examples and highlight newly developed strategies to analyze the large-sample properties of Gibbs posterior distributions.

Monday, January 27th: Ray Bai, University of Pennsylvania

499 Dirac Science Library, 10:00am

Title: Fast Algorithms and Theory for High-Dimensional Bayesian Varying Coefficient Models

Abstract: Nonparametric varying coefficient (NVC) models are widely used for modeling time-varying effects on responses that are measured repeatedly. In this talk, we introduce the nonparametric varying coefficient spike-and-slab lasso (NVC-SSL) for Bayesian estimation and variable selection in NVC models. The NVC-SSL simultaneously selects and estimates the functionals of the significant time-varying covariates, while also accounting for temporal correlations. Our model can be implemented using a highly efficient expectation-maximization (EM) algorithm, thus avoiding the computational intensiveness of Markov chain Monte Carlo (MCMC) in high dimensions. In contrast to frequentist NVC models, hardly anything is known about the large-sample properties for Bayesian NVC models. In this talk, we take a step towards addressing this longstanding gap between methodology and theory by deriving posterior contraction rates under the NVC-SSL model when the number of covariates grows at nearly exponential rate with sample size. Finally, we introduce a simple method to make our method robust to misspecification of the temporal correlation structure. We illustrate our methodology through simulation studies and data analysis.

Friday, January 31st: Hyebin Song, University of Wisconsin

214 Duxbury Hall, 10:00am

Title: Statistical Inference for Large-Scale Data with Incomplete Labels

Abstract: In various real-world problems, we are presented with data with partially observed or contaminated labels. One example is datasets from deep mutational scanning (DMS) experiments in proteomics, which typically do not contain non-functional sequences. In many of these settings, the problem of interest is high-dimensional where the number of features is substantially larger than the sample size. The combination of contamination in the labels and high-dimensionality present both statistical and computational challenges. In this talk, I will present new statistical inference procedures for analyzing noisy, high-dimensional binary data. With the key observation that the noisy labels problem belongs to a special sub-class of generalized linear models, I will first describe convex and non-convex approaches, based on the method of moments and the likelihood function. I demonstrate that the parameter estimates from both approaches achieve the minimax optimal mean-squared error rates, and describe a valid testing procedure based on de-biasing the parameter estimates from both approaches. I will then discuss scalable algorithms for parameter estimation using the two approaches with large-scale data, and compare the two approaches based on theoretical and empirical studies. Finally, I will present an application of our methodology to inferring sequence-function relationships and designing highly stabilized enzymes from large-scale DMS data.

Monday, February 3rd: Tianjian Zhou, University of Chicago

499 Dirac Science Library, 10:00am

Title: Inferring Latent Tumor Cell Subpopulations from Bulk DNA Sequencing Data

Abstract: Tumor cell population consists of genetically heterogeneous subpopulations (subclones), with each subpopulation being characterized by overlapping sets of single nucleotide variants (SNVs). Bulk sequencing data using high-throughput sequencing technology provide short reads mapped to many nucleotide loci as a mixture of signals from different subclones. Based on such data, we infer tumor subclones using generalized latent factor models. Specifically, we estimate the number of subclones, their genotypes, population frequencies, and the phylogenetic tree spanned by the inferred subclones. Prior probabilities are assigned to these latent quantities, and posterior inference is implemented through Markov chain Monte Carlo simulations. A key innovation in our method, TreeClone, is to model short reads mapped to pairs of proximal SNVs, which we refer to as mutation pairs. The performance of our method is assessed using simulated and real datasets with single and multiple tumor samples.