## Spring 2018 Colloquia

**Upcoming Colloquia:**

- April 20th - Xing Qiu (University of Rochester Medical Center)

Tuesday, January 9th: Dr. Hongyuan Cao, University of Missouri

204 Duxbury Hall, 2:00pm

Title: Multiple testing meets isotonic regression for high dimensional integrative analysis

Abstract: Large scale multiple testing is a fundamental problem in statistical inference. Recent technological advancement makes available various types of auxiliary information such as prior data and external covariates. Our goal is to utilize such auxiliary information to improve statistical power in multiple testing. This is formally achieved through a shape-constrained relationship between auxiliary information and the prior probabilities of being null. We propose to estimate the unknown parameters by embedding the isotonic regression in the EM algorithm. The resulting testing procedure utilizes the ordering of the auxiliary information and is thus very robust. We show that the proposed method leads to a large power increase, while controlling the false discovery rate, both empirically and theoretically. Extensive simulations demonstrate the advantage of the proposed method over several state-of-the-art methods. The new methodology is illustrated with data set from genome-wide association studies.

Friday, January 12th: Dr. Anderson Ye Zhang, Yale University

214 Duxbury Hall, 10:00am

Title: Computational and Statistical Guarantees of Mean Field Variational Inference

Abstract: The mean field variational inference is widely used in statistics and machine learning to approximate posterior distributions. Despite its popularity, there exist remarkably little fundamental theoretical justifications. To the best of our knowledge, the iterative algorithm has never been investigated for any high-dimensional or complex model. In this talk we attempt to establish computational and statistical guarantees of mean field variational inference. For the community detection problem, we show that the iterative algorithm has a linear convergence to the optimal statistical accuracy within log n iterations. In addition, the technique we develop can be extended to analyzing Expectation-maximization and Gibbs sampler.

Friday, January 19th: Dr. Steffen Ventz, University of Rhode Island

214 Duxbury Hall, 10:00am

Title: Bayesian Uncertainty-Directed Designs for Clinical Trials

Abstract: Most Bayesian response-adaptive designs unbalance randomization rates towards the most promising arms with the goal of increasing the number of positive treatment outcomes during the study, even though the primary aim of the trial is different. In this talk I discuss Bayesian uncertainty directed designs (BUD), a class of Bayesian designs in which the investigator specifies an information measure tailored to the clinical experiment. All decisions during the trial are selected to optimize the available information at the end of the study. The approach can be applied to several designs, ranging from early-stage dose-finding trial designs to biomarker-driven and multi-endpoint trials. I discuss the asymptotic limit of the patient allocation proportion to treatments. The finite-sample operating characteristics of BUD designs are illustrate through several examples, including a dose-finding design with biomarker measurements that allow dose optimization at the individual level, biomarker-stratified trials and trials with multiple co-primary endpoints.

Tuesday, January 23rd: Dr. David Jones, Duke University

214 Duxbury Hall, 2:00pm

Title: Designing Test Information and Test Information in Design

Abstract: Lindley (1956) and DeGroot (1962) developed a general framework for constructing Bayesian measures of the expected information that an experiment will provide for estimation. We propose an analogous framework for Bayesian measures of information for decision problems, such as hypothesis testing, classification, or model selection. We demonstrate how these "test information" measures can be used in experimental design, and show that the resulting designs differ from designs based on estimation information, e.g., Fisher information or the information proposed by Lindley (1956) (which results in D-optimal designs). The underlying intuition of our design proposals is straightforward: to distinguish between two or more models we should collect data from regions of the covariate space for which the models differ most. We identify a fundamental coherence identity which, when satisfied, ensures that the optimal design does not depend on which hypothesis is true. We additionally establish an asymptotic equivalence between our test information measures and Fisher information, thereby offering insight for contexts where both testing and estimation are of interest. Lastly, we illustrate our design ideas by applying them to a classification problem in astronomy. The task is to schedule future observation times in order to best classify stars based on time series of their brightness.

Friday, January 26th: Dr. Cyrus DiCiccio, Stanford University

214 Duxbury, 10:00am

Title: Improving Efficiency of Hypothesis Tests via Data Splitting

Abstract: Data splitting, a tool that is well studied and commonly used for estimation problems such as assessing prediction error, can also be useful in testing problems where a portion of the data can be allocated to make the testing problem easier in some sense, say by estimating or even eliminating nuisance parameters, dimension-reduction, etc.. In single or multiple testing problems that include a large number of parameters, there can be a dramatic increase of power by reducing the number of parameters tested, particularly when the number of non-null parameters is relatively sparse. While there is some loss of power associated with testing on only a fraction of the available data, carefully selecting a test statistic may in turn improve power, though it remains unclear whether the reduction of the number of parameters under consideration can outweigh the loss of power from splitting the data. To combat the inherent loss of power seen with data splitting, methods of combining inference across several splits of the data are developed. The power of these methods is compared with the power of full data tests, as well as tests using only a single split of the data.

Tuesday, January 30th: Dr. Julia Fukuyama, Stanford University

214 Duxbury Hall, 2:00pm

Title: Dimensionality Reduction with Structured Variables and Applications to the Microbiome

Abstract: Studies of the microbiome, the complex communities of bacteria that live in and around us, present interesting statistical problems. In particular, bacteria are best understood as the result of a continuous evolutionary process and methods to analyze data from microbiome studies should use the evolutionary history. Motivated by this example, I describe adaptive gPCA, a method for dimensionality reduction that uses the evolutionary structure as a regularizer and to improve interpretability of the low-dimensional space. I also discuss how adaptive gPCA applies to general variable structures, including variables structured according to a network, as well as implications for supervised learning and structure estimation.

Friday, February 2nd: Dr. Zihuai He, University of Columbia

214 Duxbury Hall, 10:00am

Title: Inference for Statistical Interactions Under Misspecified or High-Dimentional Main Effects

Abstract: An increasing number multi-omic studies have generated complex high-dimentional data. A primary focus of these studies is to determine whether exposures interact in the effect that they produce on an outcome of interest. Interaction is commonly assessed by fitting regression models in which the linear predictor includes the product between those exposures. When the main interest lies in interactions, the standard approach is not satisfactory because it is prone to (possibly severe) type I error inflation when the main exposure effects are misspecified or high-dimentional. I will propose generalized score type tests for high-dimentional interaction effects on correlated outcomes. I will also discuss the theoretical justification of some empirical observations regarding Type I error control, and introduce solutions to achieve robust inference for statistical interactions. The proposed methods will be illustrated using an example from the Multi-Ethnic Study of Atherosclerosis (MESA), investigating interaction between measures of neighborhood environment and genetic regions on longitudinal measures of blood pressure over a study period of about seven years with four exams.

Tuesday, February 6th: Dr. Junwei Lu, Princeton University

214 Duxbury Hall, 2:00pm

Title: Combinatorial Inference

Abstract: We propose the combinatorial inference to explore the topological structures of graphical models. The combinatorial inference can conduct the hypothesis tests on many graph properties including connectivity, hub detection, perfect matching, etc. On the other side, we also develop a generic minimax lower bound which shows the optimality of the proposed method for a large family of graph properties. Our methods are applied to the neuroscience by discovering hub voxels contributing to visual memories.

Friday, February 9th: Dr. Hyungsuk Tak, SAMSI

214 Duxbury Hall, 10:00am

Title: Astronomical Time Delay Estimation via a Repelling-Attracting Metropolis Algorithm

Abstract: I introduce an astronomical time delay estimation problem and a new Markov chain Monte Carlo method. The gravitational field of a galaxy can act as a lens and deflect the light emitted by a more distant object such as a quasar (quasi-stellar object). Strong gravitational lensing causes multiple images of the same quasar to appear in the sky. Since the light in each gravitationally lensed image traverses a different path length from the quasar to the Earth, fluctuations in the source brightness are observed in the several images at different times. The time delay between these fluctuations can be used to constrain cosmological parameters and can be inferred from the time series of brightness data. To estimate the time delay, we construct a model based on a state-space representation for irregularly observed time series generated by a latent continuous-time Ornstein–Uhlenbeck process.

However, the time delay estimation often suffers from multimodality. To handle this, we propose the repelling-attracting Metropolis (RAM) algorithm that maintains the simple-to-implement nature of the Metropolis algorithm, but is more likely to jump between modes. The RAM algorithm is a Metropolis-Hastings algorithm with a proposal that consists of a downhill move in density that aims to make local modes repelling, followed by an uphill move in density that aims to make local modes attracting. This down-up movement in density increases the probability of a proposed move to a different mode. Because the acceptance probability of the proposal involves a ratio of intractable integrals, we introduce an auxiliary variable which creates a term in the acceptance probability that cancels with the intractable ratio.

Tuesday, February 13th: Dr. Michael Sohn, University of Pennsylvania

214 Duxbury Hall, 2:00pm

Title: Statistical Methods in Microbiome Data Analysis

Abstract: Microbiome study involves new computational and statistical challenges due to the characteristics of microbiome data: highly sparsity, over-dispersion, and high-dimensionality. I am going to present two methods that account for the characteristics of microbiome data: 1) a GLM-based latent variable ordination method and 2) a compositional mediation model.

1) GLM-based latent variable ordination method: Distance-based ordination methods, such as the principal coordinate analysis (PCoA), are incapable of distinguishing between location effect (i.e., the difference in mean) and dispersion effect (i.e., the difference in variation) when there is a strong dispersion effect. In other words, PCoA may falsely display a location effect when there is a strong dispersion effect but no location effect. To resolve this potential problem, I proposed, as an ordination method, a zero-inflated quasi-Poisson factor model whose estimated factor loadings are used to display the similarity of samples.

2) Compositional mediation model: The causal mediation model has been extended to incorporate nonlinearity, treatment-mediation interaction, and multiple mediators. These models, however, are not directly applicable when mediators are components of a composition. I proposed a causal, compositional mediation model utilizing the algebra for compositions in the simplex space and an L1 penalized linear regression for compositional data in high-dimensional settings. The estimators of the direct and indirect (or mediation) effects are defined under the potential outcomes framework to establish causal interpretation. The model involves a novel integration of statistical methods in high dimensional regression analysis, compositional data analysis, and causal inference.

Friday, February 23rd: Dr. Jared Murray, University of Texas

214 Duxbury Hall, 10:00am

Title: Bayesian Regression Tree Models for Causal Inference: Regularization, Confounding, and Heterogeneous Effects

Abstract: We introduce a semi-parametric Bayesian regression model for estimating heterogeneous treatment effects from observational data. Standard nonlinear regression models, which may work quite well for prediction, can yield badly biased estimates of treatment effects when fit to data with strong confounding. Our Bayesian causal forests model avoids this problem by directly incorporating an estimate of the propensity function in the specification of the response model, implicitly inducing a covariate-dependent prior on the regression function. This new parametrization also allows treatment heterogeneity to be regularized separately from the prognostic effect of control variables, making it possible to informatively "shrink to homogeneity", in contrast to existing Bayesian non- and semi-parametric approaches.

Preprint: https://arxiv.org/abs/1706.09523

Friday, March 9th: Professor Rong Chen, Rutgers University

214 Duxbury Hall, 10:00am

Title: Factor Model for High Dimensional Matrix Valued Time Series

Abstract: In finance, economics and many other field, observations in a matrix form and tensor form are often observed over time. For example, many economic indicators are obtained in different countries over time. Various financial characteristics of many companies over time. Although it is natural to turn the matrix observations into a long vector then use standard vector time series models or factor analysis, it is often the case that the columns and rows of a matrix represent different sets of information that are closely interplayed. We propose a novel factor model that maintains and utilizes the matrix structure to achieve greater dimensional reduction as well as easier interpretable factor structure. Estimation procedure and its theoretical properties and model validation procedures are investigated and demonstrated with simulated and real examples. Extension to tensor time series will be discussed.

*Joint work with Dong Wang (Princeton University), Xialu Liu (San Diego State University) and Elynn Chen (Rutgers University)*

Friday, March 23rd: Dr. Suprateek Kundu, Emory University

214 Duxbury Hall, 10:00am

Title: Scalable Bayesian Variable Selection for Structured High-dimensional Data

Abstract: Variable selection for structured covariates lying on an underlying known graph is a problem motivated by practical applications and has been a topic of increasing interest. However, most of the existing methods may not be scalable to high dimensional settings involving tens of thousands of variables lying on known pathways such as the case in genomics studies. We propose an adaptive Bayesian shrinkage approach which incorporates prior network information by smoothing the shrinkage parameters for connected variables in the graph, so that the corresponding coefficients have a similar degree of shrinkage. We t our model via a computationally efficient expectation maximization algorithm which scalable to high dimensional settings (p~100;000). Theoretical properties for fixed as well as increasing dimensions are established, even when the number of variables increases faster than the sample size. We demonstrate the advantages of our approach in terms of variable selection, prediction, and computational scalability via a simulation study, and apply the method to a cancer genomics study.

Friday, March 30th: Dr. Dingcheng Li, Baidu

214 Duxbury Hall, 10:00am

Title: Empowering Natural Language Processing with Data Semantics Discoveries

Abstract: Data semantics has been recognized quite helpful for natural language processing. For example, adding semantically related tags refines the efficiency and the precision of literature search and construction of word embeddings improves the accuracy and the reliability of document classifications. However, the two tasks, which involve heavy semantics, are often challenging when the data size and/or the label size, are extremely large and/or when label hierarchies have to be considered. In order to tackle the first challenge, we propose a novel approach, Deep Level-wise Extreme Multi-label Learning, and Classification (Deep Level-wise XMLC), to facilitate the semantic indexing of literature. For the second challenge, we propose to construct a compound neural network for representation learning. This compound neural network is composed of a topic word embedding module, a topic sparse autoencoder, and a CNN-LSTM based classifier. Empirical experiments show that Deep Level-wise XMLC can handle tagging tasks with thousands of hierarchically organized labels and the compound neural network outperforms the state-of-the-art on different datasets.

Tuesday, April 2nd: Dr. Jyotishka Datta, University of Arkansas at Fayetteville

214 Duxbury Hall, 2:00pm

Title: Bayesian Sparse Signal Recovery: Horseshoe Regularization

Abstract: Global-local shrinkage priors have been established as the current state-of-the art inferential tool for sparse signal detection and recovery as well as the default choice for handling non-linearity in what have hitherto been paradoxical problems without a Bayesian answer. Despite these success stories, certain aspects of their behavior, such as, validity as a non-convex regularization method, performance in presence of correlated errors or adapting to unknown error distribution, remain unexplored. In this talk, I will offer insightful solutions to some of these open problems motivated by the changing landscape of modern applications. In the first half, I will discuss the notions of theoretical optimality for sparse signal recovery using global-local shrinkage priors. In the second half, I will build a non-convex prior-penalty dual that offers the best of both Bayesian and frequentist worlds, by merging full uncertainty characterization with fast and direct mode exploration. The distinguishing feature from existing non-convex optimization approaches is a full probabilistic representation of the penalty as the negative of the logarithm of a suitable prior, which in turn enables efficient expectation-maximization and local linear approximation algorithms for optimization and MCMC for uncertainty quantification. We will also demonstrate the performance of these methods on synthetic data-sets and cancer genomics studies, where these methods provide better statistical performance, and the computation requires a fraction of time of state-of-the-art non-convex solvers.

Thursday, April 5th: Dr. Chong Wu, University of Minnesota

214 Duxbury Hall, 2:00pm

Title: Adaptive Significance Testing for High-Dimensional Generalized Linear Models with Application to Alzheimer’s Disease

Abstract: Significance testing for high-dimensional generalized linear models (GLMs) has been increasingly needed in various applications. Motivated by the analysis of an Alzheimer’s disease dataset, I will first present an adaptive test on a high-dimensional parameter of a GLM in the presence of a low-dimensional nuisance parameter, which can maintain correct Type 1 error rate and high power across a wide range of scenarios.

In the second part, I will consider the case with a high-dimensional nuisance parameter. I will present a new method that combines non-convex penalized regression and adaptive testing, aiming to control Type 1 error rate and maintain high power. To calculate its p-value analytically, the asymptotic distribution of the test statistic is derived.

I will illustrate the applications of the newly proposed methods to the Alzheimer’s Disease Neuroimaging Initiative (ADNI) dataset, detecting possible associations between Alzheimer’s disease and genetic variants in some pathways, and identifying possible gene/pathway gender interactions for Alzheimer’s disease.

Friday, April 6th: Dr. Malay Ghosh, University of Florida

214 Duxbury Hall, 10:00am

Title: Bayesian Multiple Testing under Sparsity

Abstract: This talk reviews certain Bayesian procedures that have recently been proposed to address multiple testing under sparsity. Consider the problem of simultaneous testing for the means of independent normal observations. In this talk we study asymptotic optimality properties of certain multiple testing rules in a Bayesian decision theoretic framework, where the overall loss of a multiple testing rule is taken as the number of misclassified hypotheses. The multiple testing rules that are considered, include spike and slab priors as well as a general class of one-group shrinkage priors for the mean parameters. The latter is rich enough to include, among others, the families of three parameter beta, generalized double Pareto priors, and in particular the horseshoe, the normal-exponential-gamma and the Strawderman-Berger priors. Within the chosen asymptotic framework, the multiple testing rules under study asymptotically attain the risk of the Bayes Oracle. Some classical multiple testing procedures are also evaluated within the proposed Bayesian framework.

Friday, April 20th: Dr. Xing Qiu, University of Rochester Medical Center

214 Duxbury Hall, 10:00am

Title: The Dawn of Large p^{2}-Level Analyses

Abstract: In the past two decades or so, the emergence of many different types of high-throughput data, such as whole transcriptome gene expression, genotyping, and microbiota abundance data, has revolutionized medical research. One common property shared by these “Omics” data is that typically they have much more features than the number of independent measurements (sample size). This property is also known as the “large *p*, small *n*” property in the research community, and has motivated many instrumental statistical innovations.

Due to the rapid advancing of biotechnology, the unit cost of generating high-throughput data has decreased significantly in recent years and the sample size of those data has increased considerably. With the increased sample size and better signal-to-noise ratio, medical investigators are starting to ask more sophisticated questions – feature selection based on hypothesis testing and regression analysis is no longer the end, but the new starting point for secondary analyses such as network analysis, multi-modal data association, gene set analyses, etc. The overarching theme of these advanced analyses is that they all require statistical inference for models that involve p^{2} parameters. It takes a combination of proper data preprocessing, feature selection, dimension reduction, model building and selection, as well as domain knowledge and computational skills to perform these avant-garde analyses. Despite of these technical challenges, I believe that they will soon become mainstream, and inspire a generation of young statisticians and data scientists to invent the next big breakthroughs in statistical science. In this talk, I will share some of my recent methodology and collaborative research that involves “large p^{2}” models, and list a few potential extensions of these methods that may be used in other areas of statistics.