## Spring 2016 Colloquia

**Friday, January 8: Dr. Dehan Kong, University of North Carolina (Faculty Candidate)**

HCB 103, 10:00am

Title: High-dimensional Matrix Linear Regression Model

Abstract: We develop a high-dimensional matrix linear regression model (HMLRM) to correlate matrix responses with high-dimensional scalar covariates when coefficient matrices have low-rank structures. We propose a fast and efficient screening procedure based on the spectral norm to deal with the case that the dimension of scalar covariates is ultra-high. We develop an efficient estimation procedure based on the nuclear norm regularization, which explicitly borrows the matrix structure of coefficient matrices. We systematically investigate various theoretical properties of our estimators, including estimation consistency, rank consistency, and the sure independence screening property under HMLRM. We examine the finite-sample performance of our methods using simulations and a large-scale imaging genetic dataset collected by the Alzheimer's Disease Neuroimaging Initiative study.

**Tuesday, January 12: Naveen Narisetty, University of Michigan (Faculty Candidate)**

214 Duxbury Hall, 10:00am

Title: Consistent and Scalable Bayesian Model Selection for High Dimensional Data

Abstract: The Bayesian paradigm offers a flexible modeling framework for statistical analysis, but relative to penalization-based methods, little is known about the consistency of Bayesian model selection methods in the high dimensional setting. I will present a new framework for understanding Bayesian model selection consistency, using sample size dependent spike and slab priors that help achieve appropriate shrinkage. More specifically, strong selection consistency is established in the sense that the posterior probability of the true model converges to one even when the number of covariates grows nearly exponentially with the sample size. Furthermore, the posterior on the model space is asymptotically similar to the L0 penalized likelihood. I will also introduce a new Gibbs sampling algorithm for posterior computation, which is much more scalable for high dimensional problems than the standard Gibbs sampler, and yet retains the strong selection consistency property. The new algorithm and the consistency theory work for a variety of problems including linear and logistic regressions, and a more challenging problem of censored quantile regression where a non-convex loss function is involved.

**Friday, January 15: Jonathan Bradley, University of Missouri (Faculty Candidate)**

214 Duxbury Hall, 10:00am

Title: Computationally Efficient Distribution Theory for Bayesian Inference of High-Dimensional Dependent Count-Valued Data

Abstract: We introduce a Bayesian approach for multivariate spatio-temporal prediction for high-dimensional count-valued data. Our primary interest is when there are possibly millions of data points referenced over different variables, geographic regions, and times. This problem requires extensive methodological advancements, as jointly modeling correlated data of this size leads to the so-called "big n problem." The computational complexity of prediction in this setting is further exacerbated by acknowledging that count-valued data are naturally non-Gaussian. Thus, we develop a new computationally efficient distribution theory for this setting. In particular, we introduce a multivariate log-gamma distribution and provide substantial theoretical development including: results regarding conditional distributions, an asymptotic relationship with the multivariate normal distribution, and full-conditional distributions for a Gibbs sampler. To incorporate dependence between variables, regions, and time points, a multivariate spatio-temporal mixed effects model (MSTM) is used. The results in this manuscript are extremely general, and can be used for data that exhibit fewer sources of dependency than what we consider (e.g., multivariate, spatial-only, or spatio-temporal-only data). Hence, the implications of our modeling framework may have a large impact on the general problem of jointly modeling correlated count-valued data. We show the effectiveness of our approach through a simulation study. Additionally, we demonstrate our proposed methodology with an important application analyzing data obtained from the Longitudinal Employer-Household Dynamics (LEHD) program, which is administered by the U.S. Census Bureau.

**Friday, January 22: Abhra Sarkar, Duke University (Faculty Candidate)**

214 Duxbury Hall, 10:00am

Title: Novel Statistical Frameworks for Analysis of Structured Sequential Data

Abstract: We are developing a broad array of novel statistical frameworks for analyzing complex sequential data sets. Our research is primarily motivated by a collaboration with neuroscientists trying to understand the neurological, genetic and evolutionary basis of human communication using bird and mouse models. The data sets comprise structured sequences of syllables or `songs' produced by animals from different genotypes under different experimental conditions. The primary goal is then to elucidate the roles of different genotypes and experimental conditions on animal vocalization behaviors and capabilities. We have developed novel statistical methods based on first order Markovian dynamics that help answer these important scientific queries. First order dynamics is, however, insufficiently flexible to learn complex serial dependency structures and systematic patterns in the vocalizations, an important secondary goal in these studies. To this end, we have developed a sophisticated nonparametric Bayesian approach to higher order Markov chains building on probabilistic tensor factorization techniques. Our proposed method is of very broad utility, with applications not limited to analysis of animal vocalizations, and provides new insights into the serial dependency structures of many previously analyzed sequential data sets arising from diverse application areas. Our method has appealing theoretical properties and practical advantages, and achieves substantial gains in performance compared to previously existing methods. Our research also paves the way to advanced automated methods for more sophisticated dynamical systems, including higher order hidden Markov models that can accommodate more general data types.

**Tuesday, January 26: Alexander Petersen, University of California, Davis (Faculty Candidate)**

214 Duxbury Hall, 10:00am

Title: Representation of Samples of Density Functions and Regression for Random Objects

Abstract: In the first part of this talk, we will discuss challenges associated with the analysis of samples of one-dimensional density functions. Due to their inherent constraints, densities do not live in a vector space and therefore commonly used Hilbert space based methods of functional data analysis are not appropriate. To address this problem, we introduce a transformation approach, mapping probability densities to a Hilbert space of functions through a continuous and invertible map. Basic methods of functional data analysis, such as the construction of functional modes of variation, functional regression or classification, are then implemented by using representations of the densities in this linear space. Transformations of interest include log quantile density and log hazard transformations, among others. Rates of convergence are derived, taking into account the necessary preprocessing step of density estimation. The proposed methods are illustrated through applications in brain imaging.

The second part of the talk will address the more general problem of analyzing complex data that are non-Euclidean and specifically do not lie in a vector space. To address the need for statistical methods for such data, we introduce the concept of Fr\'echet regression. This is a general approach to regression when responses are complex random objects in a metric space and predictors are in $\mathcal{R}^p$. We develop generalized versions of both global least squares regression and local weighted least squares smoothing. We derive asymptotic rates of convergence for the corresponding sample based fitted regressions to the population targets under suitable regularity conditions by applying empirical process methods. Illustrative examples include responses that consist of probability distributions and correlation matrices, and we demonstrate the proposed Fr\'echet regression for demographic and brain imaging data.

**Friday, January 29: Dr. Yifei Sun, Johns Hopkins University (Faculty Candidate)**

214 Duxbury Hall, 10:00am

Title: Recurrent Marker Processes in the Presence of Competing Terminal Events

Abstract: In follow-up studies, utility marker measurements are usually collected upon the occurrence of recurrent events until a terminal event such as death takes place. In this talk, we define the recurrent marker process to characterize utility accumulation over time. For example, with medical cost and repeated hospitalizations being treated as marker and recurrent events respectively, the recurrent marker process is the trajectory of total medical cost spent, which stops to increase after death. In many applications, competing risks arise as subjects are at risk of more than one mutually exclusive terminal event, such as death from different causes, and modeling the recurrent marker process for each failure type is often of interest. However, censoring creates challenges in the methodological development, because for censored subjects, both failure type and recurrent marker process after censoring are unobserved. To circumvent this problem, we propose a nonparametric framework for analyzing this type of data. In the presence of competing risks, we start with an estimator by using marker information from uncensored subjects. As a result, the estimator can be inefficient under heavy censoring. To improve efficiency, we propose a second estimator by combining the first estimator with auxiliary information from the estimate under non-competing risks model. The large sample properties and optimality of the second estimator is established. Simulation studies and an application to the SEER-Medicare linked data are presented to illustrate the proposed methods.

**Tuesday, February 2: Guan Yu, University of North Carolina at Chapel Hill (Faculty Candidate)**

214 Duxbury Hall, 10:00am

Title: Supervised Learning Incorporating Graphical Structure among Predictors

Abstract: With the abundance of high dimensional data in various disciplines, regularization techniques are very popular these days. Despite the success of these techniques, some challenges remain. One challenge is the development of effi cient methods incorporating structure information among predictors. Typically, the structure information among predictors can be modeled by the connectivity of an undirected graph using all predictors as nodes of the graph. In this talk, I will introduce an e cient regularization technique incorporating graphical structure information among predictors. Specifi cally, according to the undirected graph, we use a latent group lasso penalty to utilize the graph node-by-node. The predictors connected in the graph are encouraged to be selected jointly. This new regularization technique can be used for many supervised learning problems. For sparse regression, our new method using the proposed regularization technique includes adaptive Lasso, group Lasso, and ridge regression as special cases. Theoretical studies show that it enjoys model selection consistency and acquires tight fi nite sample bounds for estimation and prediction. For the multi-task learning problem, our proposed graph-guided multi-task method includes the popular `2;1-norm regularized multi-task learning method as a special case. Numerical studies using simulated datasets and the Alzheimer's Disease Neuroimaging Initiative (ADNI) dataset also demonstrate the eff ectiveness of the proposed methods.

**Friday, February 5: Dr. Yun Yang, Duke University (Faculty Candidate)**

214 Duxbury Hall, 10:00am

Title: Computationally efficient high-dimensional variable selection via Bayesian procedures

Abstract: Variable selection is fundamental in many high-dimensional statistical problems with sparsity structures. Much of the literature is based on optimization methods, where penalty terms are incorporated that yield both convex and non-convex optimization problems. In this talk, I will take a Bayesian point of view on high-dimensional regression, by placing a prior on the model space and performing the necessary integration so as to obtain a posterior distribution. In particular, I will show that a Bayesian approach can consistently select all relevant covariates under relatively mild conditions from a frequentist point of view.

Although Bayesian procedures for variable selection are provably effective and easy to implement, it has been suggested by many statisticians that Markov Chain Monte Carlo (MCMC) algorithms for sampling from the posterior distributions may need a long time to converge, as sampling from an exponentially large number of sub-models is an intrinsically hard problem. Surprisingly, our work shows that this plausible "exponentially many model" argument is misleading. By introducing a truncated sparsity prior for variable selection, we provide a set of conditions that guarantee the rapid mixing of a particular Metropolis-Hastings algorithm. The number of iterations for this Markov chain to reach stationarity is linear in the number of covariates up to a logarithmic factor.

**Friday, February 19: Dr. Somnath Datta, University of Florida**

214 Duxbury Hall, 10:00am

Title: Multi-Sample Adjusted U-Statistics that Account for Confounding Covariates

Abstract: Multi-sample U-statistics encompass a wide class of test statistics that allow the comparison of two or more distributions. U-statistics are especially powerful because they can be applied to both numeric and non-numeric (e.g., textual) data. However, when comparing the distribution of a variable across two or more groups, observed differences may be due to confounding covariates. For example, in a case-control study, the distribution of exposure in cases may differ from that in controls entirely because of variables that are related to both exposure and case status and are distributed differently among case and control participants. We propose to use individually reweighted data (using the propensity score for prospective data or the stratification score for retrospective data) to construct adjusted U-statistics that can test the equality of distributions across two (or more) groups in the presence of confounding covariates. Asymptotic normality of our adjusted U-statistics is established and a closed form expression of their asymptotic variance is presented. The utility of our procedures is demonstrated through simulation studies as well as an analysis of genetic data.

**Friday, March 4: Dr. Guang Cheng, Purdue University**

214 Duxbury Hall, 10:00am

Title: How Many Processors Do We Really Need in Parallel Computing?

Abstract: This talk explores statistical versus computational trade-off to address a basic question in a typical divide-and-conquer setup: what is the minimal computational cost in obtaining statistical optimality? In smoothing spline models, we observe an intriguing phase transition phenomenon for the number of deployed machines that ends up being a simple proxy for computing cost. Specifically, a sharp upper bound for the number of machines is established when the number is below this bound, statistical optimality (in terms of nonparametric estimation or testing) is achievable; otherwise, statistical optimality becomes impossible.

**Friday, March 18: Dr. Tianfu Wu, University of California, Los Angeles**

214 Duxbury Hall, 10:00am

Title: Towards a Visual Turing Test and Lifelong Learning: Learning Deep Hierarchical Models and Cost-sensitive Decision Policies for Understanding Visual Big Data

Abstract: Modern technological advances produce data at breathtaking scales and complexities such as the images and videos on the web. Such big data require highly expressive models for their representation, understanding and prediction. To fit such models to the big data, it is essential to develop practical learning methods and fast inferential algorithms. My research has been focused on learning expressive hierarchical models and fast inference algorithms with homogeneous representation and architecture to tackle the underlying complexities in such big data from statistical perspectives. In this talk, with emphasis on a visual restricted Turing test -- the grand challenge in computer vision, I will introduce my work on (i) Statistical Learning of Large Scale and Highly Expressive Hierarchical Models from Big Data, and (ii) Bottom-up/Top-down Inference with Hierarchical Models by Learning Near-Optimal Cost-Sensitive Decision Policies. Applications in object detection, online object tracking and robot autonomy will be discussed.

**Friday, March 25: Dr. Faming Liang, University of Florida**

214 Duxbury Hall, 10:00am

Title: A Blockwise Coordinate Consistent Method for High-Dimensional Parameter Estimation

Abstract: The dramatic improvement in data collection and acquisition technologies in the last decades has enabled scientists to collect a great amount of high dimensional data. Due to their intrinsic nature, many of the high dimensional data, such as omics data and genome-wide association study (GWAS) data, have a much smaller sample size than their dimension (a.k.a. small-n-large-P). How to estimate the parameters for the high dimensional models with a small sample size is still a challenge problem though substantial progress has been obtained in the last decades. The popular method to this problem is regularization, but which can perform badly when the sample size is small and the variables are highly correlated. To alleviate this difficulty, we propose a blockwise coordinate consistent (BCC) method, which works by maximizing a new objective function---expectation of the log-likelihood function using a cyclic algorithm: iteratively finding consistent estimates for each block of parameters conditional on the current estimates of the other parameters. The BCC method reduces the high dimensional parameter estimation problem to a series of low dimensional parameter estimation problems and is ready to be applied to parameter estimation for the complicated models used in big data analysis. Our numerical results indicate that BCC can provide a drastic improvement in both parameter estimation and variable selection over the regularization methods for high dimensional systems.

**Friday, April 1: Dr. George Michailidis, University of Florida**

214 Duxbury Hall, 10:00am

Title: Estimating high-dimensional multi-layered networks through penalized maximum likelihood

Abstract: Gaussian graphical models represent a good tool for capturing interactions between nodes represent the underlying random variables. However, in many applications in biology one is interested in modeling associations both between, as well as within molecular compartments (e.g., interactions between genes and proteins/metabolites). To this end, inferring multi-layered network structures from high-dimensional data provides insight into understanding the conditional relationships among nodes within layers, after adjusting for and quantifying the effects of nodes from other layers. We propose an integrated algorithmic approach for estimating multi-layered networks, that incorporates a screening step for significant variables, an optimization algorithm for estimating the key model parameters and a stability selection step for selecting the most stable effects. The proposed methodology offers an efficient way of estimating the edges within and across layers iteratively, by solving an optimization problem constructed based on penalized maximum likelihood (under a Gaussianity assumption). The optimization is solved on a reduced parameter space that is identified through screening, which remedies the instability in high-dimension. Theoretical properties are considered to ensure identifiability and consistent estimation of the parameters and convergence of the optimization algorithm, despite the lack of global convexity. The performance of the methodology is illustrated on synthetic data sets and on an application on gene and metabolic expression data for patients with renal disease.

**Friday, April 8: Dr. Qian Zhang, FSU College of Education**

214 Duxbury Hall, 10:00am

Title: A Comparison of Methods for Estimating Moderation Effects with Missing Data in the Predictors

Abstract: The most widely used statistical model for conducting moderation analysis is the moderated multiple regression (MMR) model. While conducting moderation analysis using MMR models, missing data could pose a challenge, mainly because of the nonlinear interaction term. In the study, we consider a simple MMR model, where the effect of predictor *X* on the outcome *Y* is moderated by a moderator *U*. The primary interest is to find ways of estimating and testing the moderation effect with the existence of missing data in the predictor *X*. We mainly focus on cases when *X* is missing completely at random and missing at random. Theoretically, it is found in the study that the existing methods including normal-distribution-based maximum likelihood estimation (NML) and normal-distribution-based multiple imputation (NMI) yield inconsistent moderation effect estimates when data are missing at random. To cope with this issue, Bayesian estimation (BE) is proposed. To compare the existing methods and the proposed BE under finite sample sizes, a simulation study is also conducted. Results indicate that the methods in comparison have different relative performance depending on various factors. The factors are missing data mechanisms, roles of variables responsible for missingness, population moderation effect sizes, sample sizes, missing data proportions, and distributions of predictor *X*. Limitations of the study and future research directions are also discussed.

**Friday, April 15: Wei Sun, Yahoo Research**

214 Duxbury Hall, 10:00am

Title: Provable Sparse Tensor Decomposition and Its Application to Personalized Recommendation

Abstract: Tensor as a multi-dimensional generalization of matrix has received increasing attention in industry due to its success in personalized recommendation systems. Traditional recommendation systems are mainly based on the user-item matrix, whose entry denotes each user's preference for a particular item. To incorporate additional information into the analysis, such as the temporal behavior of users, we encounter a user-item-time tensor. Existing tensor decomposition methods for personalized recommendation are mostly established in the non-sparse regime where the decomposition components include all features. For high dimensional tensor-valued data, many features in the components essentially contain no information about the tensor structure, and thus there is a great need for a more appropriate method that can simultaneously perform tensor decomposition and select informative features.

In this talk, I will discuss a new sparse tensor decomposition method that incorporates the sparsity of each decomposition component to the CP tensor decomposition. Specifically, the sparsity is achieved via an efficient truncation procedure to directly solve an L0 sparsity constraint. In theory, in spite of the non-convexity of the optimization problem, it is proven that an alternating updating algorithm attains an estimator whose rate of convergence significantly improves those shown in non-sparse decomposition methods. As a by-product, our method is also widely applicable to solve a broad family of high dimensional latent variable models, including high dimensional Gaussian mixtures and mixtures of sparse regression. I will show the advantages of our method in two real applications, click-through rate prediction for online advertising and high dimensional gene clustering.

**Friday, April 22: Dr. Kun Chen, University of Connecticut**

214 Duxbury Hall, 10:00am

Title: Sequential Estimation in Sparse Factor Regression

Abstract: Multivariate regression models of large scales are increasingly required and formulated in various fields. A sparse singular value decomposition of the regression component matrix is appealing for achieving dimension reduction and facilitating model interpretation. However, how to recover such a composition of sparse and low-rank structures remains a challenging problem. By exploring the connections between factor analysis and reduced-rank regression, we formulate the problem as a sparse factor regression and develop an efficient sequential estimation procedure. At each sequential step, a latent factor is constructed as a sparse linear combination of the observed predictors, for predicting the responses after accounting for the effects of the previously found latent factors. Comparing to the complicated joint estimation approach, a prominent feature of our proposed sequential method is that each step reduces to a simple regularized unit-rank regression, in which the orthogonality requirement among the sparse factors becomes optional rather than necessary. The ideas of coordinate descent and Bregman iterative methods are utilized to ensure fast computation and algorithmic convergence, even in the presence of missing data and when exact orthogonality is desired. Theoretically, we show that the sequential estimators enjoy the oracle properties for recovering the underlying sparse factor structure. The efficacy of the proposed approach is demonstrated by simulation studies and two real applications in genetics.