Fall 2018 Colloquia
214 Duxbury Hall, 10am
Title: Covariate-Adjusted Tensor Classification and Other Applications in High Dimensions
Abstract: In contemporary scientific research, it is often of great interest to predict a categorical response based on a high-dimensional tensor (i.e. multi-dimensional array) and additional covariates. Motivated by applications in science and engineering, we propose a comprehensive and interpretable discriminant analysis model, called the CATCH model (in short for Covariate-Adjusted Tensor Classification in High-dimensions). The CATCH model efficiently integrates the covariates and the tensor to predict the categorical outcome. The tensor structure is utilized to achieve easy interpretation and accurate prediction. To tackle the new computational and statistical challenges arising from the intimidating tensor dimensions, we propose a penalized approach to select a subset of the tensor predictor entries that affect classification after adjustment for the covariates. An efficient algorithm is developed to take advantage of the tensor structure in the penalized estimation. Theoretical results confirm that the proposed method achieves variable selection and prediction consistency, even when the tensor dimension is much larger than the sample size. The superior performance of our method over existing methods is demonstrated in extensive simulated and real data examples. We further note that the proposed algorithm in CATCH model has applications beyond tensor classification. We investigate its performance in differential networks and quadratic discriminant analysis. Compared with state-of-the art methods, our algorithm has significantly lower computational cost when the true model is highly sparse.
214 Duxbury Hall, 10am
Title: Mixture of Regression Models for Large Spatial Data Sets
Abstract: When a spatial regression model that links a response variable to a set of explanatory variables is desired, it is unlikely that the same regression model holds throughout the domain when the spatial domain and dataset are both large and complex. The locations where the trend changes may not be known, and we present here a mixture of regression models approach to identifying the locations wherein the relationship between the predictors and the response is similar; to estimating the model within each group; and to estimating the number of groups. An EM algorithm for estimating this model is presented along with a criterion for choosing the number of groups. Performance of the estimators and model selection is demonstrated through simulation. An example with groundwater depth and associated predictors generated from a large physical model simulation demonstrates the fit and interpretation of the proposed model.
Title: Regression Analysis of Longitudinal Data With Omitted Asynchronous Longitudinal Covariate
Abstract: Long term follow-up with longitudinal data is common in many medical investigations. In such studies, some longitudinal covariate can be omitted for various reasons. Naïve approach that simply ignores the omitted longitudinal covariate can lead to biased estimators. In this article, we propose new unbiased estimation methods to accommodate omitted longitudinal covariate. In addition, if the omitted longitudinal covariate is asynchronous with the longitudinal response, a two stage approach is proposed for valid statistical inference. Asymptotic properties of the proposed estimators are established. Extensive simulation studies provide numerical support for the theoretical findings. We illustrate the performance of our method on dataset from an HIV study.
Title: Data Science and Biostatistics: The Power of Thinking Differently Through Team Science
Abstract: The development of artificial intelligence solutions to help advance the practice of health care could be considered the next space race. In this presentation, the journey from traditional biostatistics to the new world of artificial intelligence will be explored through a series of motivating examples that emphasize the role of team science and thinking inside the black box. The talk will highlight the role of convolutional neural networks and deep learning for data that may not be typically envisioned as an image. The presentation will conclude with information on new resources that are being developed to provide cross training opportunities to help raise awareness of machine learning techniques within the biostatistical community.
Title: Increasingly Powerful Tornadoes
Abstract: Storm reports show an upward trend in the power of tornadoes from longer and wider paths and higher damage ratings. Quantifying the magnitude of the increase is difficult given diurnal and seasonal influences on tornadoes embedded within natural variations and made worse by changes for rating damage. Here we solve this problem by fitting a statistical model to a metric of power during the period 1994--2016. We find an increase of 5.5\% [(4.6, 6.5\%), 95\% CI] per year in tornado power controlling for the diurnal cycle, seasonality, natural climate variability, and the switch to a new damage scale. We find that a portion of the trend is attributable to rising ocean temperatures across the Gulf of Mexico and western Caribbean Sea. Results support the hypothesis that with added instability from more heat and moisture in a warming world tornadoes are becoming more powerful.
Title: Robust PCA by Manifold Optimization
Abstract: Robust PCA is a widely used statistical procedure to recover an underlying low- rank matrix with grossly corrupted observations. This work considers the problem of robust PCA as a nonconvex optimization problem on the manifold of low-rank matrices and proposes two algorithms based on manifold optimization. It is shown that, with a properly designed initialization, the proposed algorithms are guaranteed to converge to the underlying low-rank matrix linearly. Compared with a previous work based on the factorization of low-rank matrices, the proposed algorithms reduce the dependence on the condition number of the underlying low-rank matrix theoretically. Simulations and real data examples confirm the competitive performance of our method.
Title: Multiple Breakpoint Detection: Mixing Documented and Undocumented Changepoints
Abstract: This talk presents methods to estimate the number of changepoint time(s) and their locations in time-ordered data sequences when prior information is known about some of the changepoint times. A Bayesian version of a penalized likelihood objective function is developed from minimum description length (MDL) information theory principles. Optimizing the objective function yields estimates of the changepoint number(s) and location time(s). Our MDL penalty depends on where the changepoint(s) lie, but not solely on the total number of changepoints (such as classical AIC and BIC penalties). Specifically, configurations with changepoints that occur relatively closely to one and other are penalized more heavily than sparsely arranged changepoints. The techniques allow for autocorrelation in the observations and mean shifts at each changepoint time. This scenario arises in climate time series where a ``metadata" record exists documenting some, but not necessarily all, of station move times and instrumentation changes. Applications to climate time series are presented throughout.
Title: Interpolating Distributions for Populations in Nested Geographies Using Public-use Data with Application to the American Community Survey
Abstract: Statistical agencies often publish multiple data products from the same survey. First, they produce aggregate estimates of various features of the distributions of several socio-demographic quantities of interest. Often these area-level estimates are tabulated at small geographies. Second, statistical agencies frequently produce weighted public-use microdata samples (PUMS) that provide detailed information of the entire distribution for the same socio-demographic variables. However, the public-use micro areas usually constitute relatively large geographies in order to protect against the identification of households or individuals included in the sample. These two data products represent a trade-off in official statistics: publicly available data products can either provide detailed spatial information or detailed distributional information, but not both. We propose a model-based method to combine these two data products to produce estimates of detailed features of a given variable at a high degree of spatial resolution. Our motivating example uses the disseminated tabulations and PUMS from the American Community Survey to estimate U.S. Census tract-level income distributions and statistics associated with these distributions. Joint with Matthew Simpson, Christopher K. Wikle, and Jonathan R. Bradley
Title: Spatially Informed Variable Selection Priors with Applications to Neuroimaging Data
Abstract: There is a huge literature on Bayesian methods for variable selection that use spike-and-slab priors. Such methods, in particular, have been quite successful for applications in a variety of different fields. High-throughput genomics and neuroimaging are two of such examples. There, novel methodological questions are being generated, requiring the integration of different concepts, methods, tools and data types. These have in particular motivated the development of variable selection priors that, for example, go beyond the independence assumptions of a simple Bernoulli prior on the inclusion indicators. In this talk I will review various prior constructions that incorporate information about structural dependencies among the variables. I will look in particular at models for neuroimaging applications, where specific structural information is incorporated into the prior probability models. I will also present models that incorporate information on connectivity among brain regions.
Title: Envelope in Discriminant Analysis and Model-Based Clustering
Abstract: In this talk, I will give an overview of envelopes and discuss its applications to various supervised and unsupervised learning problems. In particular, we will discuss on how to improve efficiency in classification of vector and tensor data, how to jointly model multiple precision and covariance matrices, and how to identify latent clusters and heterogeneity in the unlabeled data.