## Spring 2018 Colloquia

**Upcoming Colloquia:**

- March 23rd - Suprateek Kundu (Emory University)
- March 30th - Dingcheng Li (Baidu)
- April 6th - Malay Ghosh (University of Florida)
- April 20th - Xing Qiu (University of Rochester Medical Center)

Tuesday, January 9th: Dr. Hongyuan Cao, University of Missouri

204 Duxbury Hall, 2:00pm

Title: Multiple testing meets isotonic regression for high dimensional integrative analysis

Abstract: Large scale multiple testing is a fundamental problem in statistical inference. Recent technological advancement makes available various types of auxiliary information such as prior data and external covariates. Our goal is to utilize such auxiliary information to improve statistical power in multiple testing. This is formally achieved through a shape-constrained relationship between auxiliary information and the prior probabilities of being null. We propose to estimate the unknown parameters by embedding the isotonic regression in the EM algorithm. The resulting testing procedure utilizes the ordering of the auxiliary information and is thus very robust. We show that the proposed method leads to a large power increase, while controlling the false discovery rate, both empirically and theoretically. Extensive simulations demonstrate the advantage of the proposed method over several state-of-the-art methods. The new methodology is illustrated with data set from genome-wide association studies.

Friday, January 12th: Dr. Anderson Ye Zhang, Yale University

214 Duxbury Hall, 10:00am

Title: Computational and Statistical Guarantees of Mean Field Variational Inference

Abstract: The mean field variational inference is widely used in statistics and machine learning to approximate posterior distributions. Despite its popularity, there exist remarkably little fundamental theoretical justifications. To the best of our knowledge, the iterative algorithm has never been investigated for any high-dimensional or complex model. In this talk we attempt to establish computational and statistical guarantees of mean field variational inference. For the community detection problem, we show that the iterative algorithm has a linear convergence to the optimal statistical accuracy within log n iterations. In addition, the technique we develop can be extended to analyzing Expectation-maximization and Gibbs sampler.

Friday, January 19th: Dr. Steffen Ventz, University of Rhode Island

214 Duxbury Hall, 10:00am

Title: Bayesian Uncertainty-Directed Designs for Clinical Trials

Abstract: Most Bayesian response-adaptive designs unbalance randomization rates towards the most promising arms with the goal of increasing the number of positive treatment outcomes during the study, even though the primary aim of the trial is different. In this talk I discuss Bayesian uncertainty directed designs (BUD), a class of Bayesian designs in which the investigator specifies an information measure tailored to the clinical experiment. All decisions during the trial are selected to optimize the available information at the end of the study. The approach can be applied to several designs, ranging from early-stage dose-finding trial designs to biomarker-driven and multi-endpoint trials. I discuss the asymptotic limit of the patient allocation proportion to treatments. The finite-sample operating characteristics of BUD designs are illustrate through several examples, including a dose-finding design with biomarker measurements that allow dose optimization at the individual level, biomarker-stratified trials and trials with multiple co-primary endpoints.

Tuesday, January 23rd: Dr. David Jones, Duke University

214 Duxbury Hall, 2:00pm

Title: Designing Test Information and Test Information in Design

Abstract: Lindley (1956) and DeGroot (1962) developed a general framework for constructing Bayesian measures of the expected information that an experiment will provide for estimation. We propose an analogous framework for Bayesian measures of information for decision problems, such as hypothesis testing, classification, or model selection. We demonstrate how these "test information" measures can be used in experimental design, and show that the resulting designs differ from designs based on estimation information, e.g., Fisher information or the information proposed by Lindley (1956) (which results in D-optimal designs). The underlying intuition of our design proposals is straightforward: to distinguish between two or more models we should collect data from regions of the covariate space for which the models differ most. We identify a fundamental coherence identity which, when satisfied, ensures that the optimal design does not depend on which hypothesis is true. We additionally establish an asymptotic equivalence between our test information measures and Fisher information, thereby offering insight for contexts where both testing and estimation are of interest. Lastly, we illustrate our design ideas by applying them to a classification problem in astronomy. The task is to schedule future observation times in order to best classify stars based on time series of their brightness.

Friday, January 26th: Dr. Cyrus DiCiccio, Stanford University

214 Duxbury, 10:00am

Title: Improving Efficiency of Hypothesis Tests via Data Splitting

Abstract: Data splitting, a tool that is well studied and commonly used for estimation problems such as assessing prediction error, can also be useful in testing problems where a portion of the data can be allocated to make the testing problem easier in some sense, say by estimating or even eliminating nuisance parameters, dimension-reduction, etc.. In single or multiple testing problems that include a large number of parameters, there can be a dramatic increase of power by reducing the number of parameters tested, particularly when the number of non-null parameters is relatively sparse. While there is some loss of power associated with testing on only a fraction of the available data, carefully selecting a test statistic may in turn improve power, though it remains unclear whether the reduction of the number of parameters under consideration can outweigh the loss of power from splitting the data. To combat the inherent loss of power seen with data splitting, methods of combining inference across several splits of the data are developed. The power of these methods is compared with the power of full data tests, as well as tests using only a single split of the data.

Tuesday, January 30th: Dr. Julia Fukuyama, Stanford University

214 Duxbury Hall, 2:00pm

Title: Dimensionality Reduction with Structured Variables and Applications to the Microbiome

Abstract: Studies of the microbiome, the complex communities of bacteria that live in and around us, present interesting statistical problems. In particular, bacteria are best understood as the result of a continuous evolutionary process and methods to analyze data from microbiome studies should use the evolutionary history. Motivated by this example, I describe adaptive gPCA, a method for dimensionality reduction that uses the evolutionary structure as a regularizer and to improve interpretability of the low-dimensional space. I also discuss how adaptive gPCA applies to general variable structures, including variables structured according to a network, as well as implications for supervised learning and structure estimation.

Friday, February 2nd: Dr. Zihuai He, University of Columbia

214 Duxbury Hall, 10:00am

Title: Inference for Statistical Interactions Under Misspecified or High-Dimentional Main Effects

Abstract: An increasing number multi-omic studies have generated complex high-dimentional data. A primary focus of these studies is to determine whether exposures interact in the effect that they produce on an outcome of interest. Interaction is commonly assessed by fitting regression models in which the linear predictor includes the product between those exposures. When the main interest lies in interactions, the standard approach is not satisfactory because it is prone to (possibly severe) type I error inflation when the main exposure effects are misspecified or high-dimentional. I will propose generalized score type tests for high-dimentional interaction effects on correlated outcomes. I will also discuss the theoretical justification of some empirical observations regarding Type I error control, and introduce solutions to achieve robust inference for statistical interactions. The proposed methods will be illustrated using an example from the Multi-Ethnic Study of Atherosclerosis (MESA), investigating interaction between measures of neighborhood environment and genetic regions on longitudinal measures of blood pressure over a study period of about seven years with four exams.

Tuesday, February 6th: Dr. Junwei Lu, Princeton University

214 Duxbury Hall, 2:00pm

Title: Combinatorial Inference

Abstract: We propose the combinatorial inference to explore the topological structures of graphical models. The combinatorial inference can conduct the hypothesis tests on many graph properties including connectivity, hub detection, perfect matching, etc. On the other side, we also develop a generic minimax lower bound which shows the optimality of the proposed method for a large family of graph properties. Our methods are applied to the neuroscience by discovering hub voxels contributing to visual memories.

Friday, February 9th: Dr. Hyungsuk Tak, SAMSI

214 Duxbury Hall, 10:00am

Title: Astronomical Time Delay Estimation via a Repelling-Attracting Metropolis Algorithm

Abstract: I introduce an astronomical time delay estimation problem and a new Markov chain Monte Carlo method. The gravitational field of a galaxy can act as a lens and deflect the light emitted by a more distant object such as a quasar (quasi-stellar object). Strong gravitational lensing causes multiple images of the same quasar to appear in the sky. Since the light in each gravitationally lensed image traverses a different path length from the quasar to the Earth, fluctuations in the source brightness are observed in the several images at different times. The time delay between these fluctuations can be used to constrain cosmological parameters and can be inferred from the time series of brightness data. To estimate the time delay, we construct a model based on a state-space representation for irregularly observed time series generated by a latent continuous-time Ornstein–Uhlenbeck process.

However, the time delay estimation often suffers from multimodality. To handle this, we propose the repelling-attracting Metropolis (RAM) algorithm that maintains the simple-to-implement nature of the Metropolis algorithm, but is more likely to jump between modes. The RAM algorithm is a Metropolis-Hastings algorithm with a proposal that consists of a downhill move in density that aims to make local modes repelling, followed by an uphill move in density that aims to make local modes attracting. This down-up movement in density increases the probability of a proposed move to a different mode. Because the acceptance probability of the proposal involves a ratio of intractable integrals, we introduce an auxiliary variable which creates a term in the acceptance probability that cancels with the intractable ratio.

Tuesday, February 13th: Dr. Michael Sohn, University of Pennsylvania

214 Duxbury Hall, 2:00pm

Title: Statistical Methods in Microbiome Data Analysis

Abstract: Microbiome study involves new computational and statistical challenges due to the characteristics of microbiome data: highly sparsity, over-dispersion, and high-dimensionality. I am going to present two methods that account for the characteristics of microbiome data: 1) a GLM-based latent variable ordination method and 2) a compositional mediation model.

1) GLM-based latent variable ordination method: Distance-based ordination methods, such as the principal coordinate analysis (PCoA), are incapable of distinguishing between location effect (i.e., the difference in mean) and dispersion effect (i.e., the difference in variation) when there is a strong dispersion effect. In other words, PCoA may falsely display a location effect when there is a strong dispersion effect but no location effect. To resolve this potential problem, I proposed, as an ordination method, a zero-inflated quasi-Poisson factor model whose estimated factor loadings are used to display the similarity of samples.

2) Compositional mediation model: The causal mediation model has been extended to incorporate nonlinearity, treatment-mediation interaction, and multiple mediators. These models, however, are not directly applicable when mediators are components of a composition. I proposed a causal, compositional mediation model utilizing the algebra for compositions in the simplex space and an L1 penalized linear regression for compositional data in high-dimensional settings. The estimators of the direct and indirect (or mediation) effects are defined under the potential outcomes framework to establish causal interpretation. The model involves a novel integration of statistical methods in high dimensional regression analysis, compositional data analysis, and causal inference.

Friday, February 23rd: Dr. Jared Murray, University of Texas

214 Duxbury Hall, 10:00am

Title: Bayesian Regression Tree Models for Causal Inference: Regularization, Confounding, and Heterogeneous Effects

Abstract: We introduce a semi-parametric Bayesian regression model for estimating heterogeneous treatment effects from observational data. Standard nonlinear regression models, which may work quite well for prediction, can yield badly biased estimates of treatment effects when fit to data with strong confounding. Our Bayesian causal forests model avoids this problem by directly incorporating an estimate of the propensity function in the specification of the response model, implicitly inducing a covariate-dependent prior on the regression function. This new parametrization also allows treatment heterogeneity to be regularized separately from the prognostic effect of control variables, making it possible to informatively "shrink to homogeneity", in contrast to existing Bayesian non- and semi-parametric approaches.

Preprint: https://arxiv.org/abs/1706.09523

Friday, March 9th: Professor Rong Chen, Rutgers University

214 Duxbury Hall, 10:00am

Title: Factor Model for High Dimensional Matrix Valued Time Series

Abstract: In finance, economics and many other field, observations in a matrix form and tensor form are often observed over time. For example, many economic indicators are obtained in different countries over time. Various financial characteristics of many companies over time. Although it is natural to turn the matrix observations into a long vector then use standard vector time series models or factor analysis, it is often the case that the columns and rows of a matrix represent different sets of information that are closely interplayed. We propose a novel factor model that maintains and utilizes the matrix structure to achieve greater dimensional reduction as well as easier interpretable factor structure. Estimation procedure and its theoretical properties and model validation procedures are investigated and demonstrated with simulated and real examples. Extension to tensor time series will be discussed.

*Joint work with Dong Wang (Princeton University), Xialu Liu (San Diego State University) and Elynn Chen (Rutgers University)*