## Fall 2016 Colloquia

**Friday, September 9: Dr. Adrian Barbu, Florida State University Department of Statistics**

214 Duxbury Hall, 10:00am

Title: A Novel Method for Obtaining Tree Ensembles by Loss Minimization

Abstract: Tree ensembles can capture the relevant variables and to some extent the relationships between them in a compact and interpretable manner. Most algorithms for obtaining tree ensembles are based on versions of Boosting or Random Forest. Previous work showed that Boosting algorithms exhibit a cyclic behavior of selecting the same tree again and again due to the way the loss is optimized. At the same time, Random Forest is not based on loss optimization and obtains a less compact and less interpretable model. In this talk we present a novel method for obtaining a compact ensemble of trees that grows a pool of trees in parallel with many independent Boosting threads and then selects a small subset and updates their leaf weights by loss optimization. Experiments on real datasets show that the obtained model has usually a smaller loss than Boosting, which is also reflected in a lower misclassification error on the test set.

**Friday, September 16: Dr. Antonio Linero, Florida State University Department of Statistics**

214 Duxbury Hall, 10:00am

Title: Bayesian regression trees for high dimensional prediction and variable selection

Abstract: Decision tree ensembles are an extremely popular tool for obtaining high quality predictions in nonparametric regression problems. Unmodified, however, many commonly used decision tree ensemble methods do not adapt to sparsity in the regime in which the number of predictors is larger than the number of observations. A recent stream of research concerns the construction of decision tree ensembles which are motivated by a generative probabilistic model, the most influential method being the Bayesian additive regression trees framework. In this talk, we take a Bayesian point of view on this problem, and show how to construction priors on decision tree ensembles which are capable of adapting to sparsity by placing a sparsity-inducing Dirichlet hyperprior on the splitting proportions of the regression tree. We demonstrate the efficacy of this approach in simulation studies, and argue for the theoretical strengths of this approach

by showing that, under certain conditions, the posterior concentrates around the true regression function at a rate which is independent of the number of predictors. Our approach has additional benefits over Bayesian methods for constructing tree ensembles, such as allowing for fully-Bayesian variable selection.

**Friday, September 23: Dr. Xiao Wang, Purdue University**

214 Duxbury Hall, 10:00am

Title: Quantile Image-on-Scalar Regression

Abstract:Quantile regression with functional response and scalar covariates has become an important statistical tool for many neuroimaging studies. In this paper, we study optimal estimation of varying coefficient functions in the framework of reproducing kernel Hilbert space. Minimax rates of convergence under both fixed and random designs are established. We have developed easily implementable estimators which are shown to be rate-optimal. Simulations and real data analysis are conducted to examine the finite-sample performance. This is a joint work with Zhengwu Zhang, Linglong Kong, and Hongtu Zhu.

**Friday, September 30: Dr. Chiwoo Park, Florida State University Industrial and Manufacturing Engineering**

214 Duxbury Hall, 10:00am

Title: Patching Gaussian Processes for Largescale Spatial Regression

Abstract: This talk presents a method for solving a Gaussian process (GP) regression with constraints on a regression domain boundary. The method can guide and improve the prediction around a domain boundary with the boundary constraints. More importantly, the method can be applied to improve a local GP regression as a solver of a large-scale regression analysis for remote sensing and other large datasets. In the conventional local GP regression, a regression domain is first partitioned into multiple local regions, and an independent GP model is fit for each local region using the training data belonging to the region. Two key issues with the local GP are (1) the prediction around the boundary of a local region is not as accurate as the prediction interior of the local region, and (2) two local GP models for two neighboring local regions produce different predictions at the boundary of the two regions, creating discontinuity in the output regression. These issues can be addressed by constraining local GP models on the boundary using our constrained GP regression approach. The performance of the proposed approach depends on the “quality” of the constraints posed on the local GP models. We present a method to estimate “good" constraints based on data. Some convergence results and numerical results of the proposed approach will be presented.

**Friday, October 7: Dr. Dan Shen, University of South Florida **

214 Duxbury Hall, 10:00am

Title: Dimension Reduction of Neuroimaging Data Analysis

Abstract: High dimensionality has become a common feature of ``big data” encountered in many divergent fields, such as neuroimaging and genetic analysis, which provides modern challenges for statistical analysis. To cope with the high dimensionality, dimension reduction becomes necessary. Principal component analysis (PCA) is arguably the most popular classical dimension reduction technique, which uses a few principal components (PCs) to explain most of the data variation.

I first introduce Multiscale Weighted PCR (MWPCR), a new variation of PCA, for neuroimaging analysis. MWPCA introduces two sets of novel weights, including global and local spatial weights, to enable a selective treatment of individual features and incorporation of class label information as well as spatial pattern within neuroimaging data. Simulation studies and real data analysis show that MWPCA outperforms several competing PCA methods.

Second we develop statistical methods for analyzing tree-structured data objects. This work is motivated by the statistical challenges of analyzing a set of blood artery trees, which is from a study of Magnetic Resonance Angiography (MRA) brain images of a set of 98 human subjects. The non-Euclidean property of tree space makes the application of conventional statistical analysis, including PCA, to tree data very challenging. We develop an entirely new approach that uses the Dyck path representation, which builds a bridge between the tree space (a non-Euclidean space) and curve space (standard Euclidean space). That bridge enables the exploitation of the power of functional data analysis to explore statistical properties of tree data sets.

**Friday, October 14: Dr. Jonathan Bradley, Florida State University Department of Statistics **

214 Duxbury Hall, 10:00am

Title: Hierarchical Models for Spatial Data with Errors that are Correlated with the Latent Process

Abstract: Prediction of a spatial Gaussian process using a “big dataset” has become a topical area of research over the last decade. The available solutions often involve placing strong assumptions on the error process associated with the data. Specifically, it has typically been assumed that the data is equal to the spatial process of principal interest plus a mutually independent error process. Further, to obtain computationally efficient predictions, additional assumptions on the latent random processes and/or parameter models have become a practical necessity (e.g., low rank models, sparse precision matrices, etc.). In this article, we consider an alternative latent process modeling schematic where it is assumed that the error process is spatially correlated and correlated with the spatial random process of principal interest. We show the counterintuitive result that error process dependencies allow one to remove assumptions on the spatial process of principal interest, and obtain computationally efficient predictions. At the core of this proposed methodology is the definition of a corrupted version of the latent process of interest, which we call the data specific latent process (DSLP). Demonstrations of the DSLP paradigm are provided through simulated examples and through an application using a large dataset consisting of the US Census Bureau’s American Community Survey 5-year period estimates of median household income on census tracts.

**Friday, October 21: Dr. Bing Li, Pennsylvania State University**

214 Duxbury Hall, 10:00am

Title: A nonparametric graphical model for functional data with application to brain networks based on fMRI

Abstract: We introduce a nonparametric graphical model whose observations on vertices are functions. Many modern applications, such as electroencephalogram and functional magnetic resonance imaging (fMRI), produce data are of this type. The model is based on Additive Conditional Independence (ACI), a statistical relation that captures the spirit of conditional independence without resorting to multi-dimensional kernels. The random functions are assumed to reside in a Hilbert space. No distributional assumption is imposed on the random functions: instead, their statistical relations are characterized nonparametrically by a second Hilbert space, which is a reproducing kernel Hilbert space whose kernel is determined by the inner product of the first Hilbert space. A precision operator is then constructed based on the second space, which characterizes ACI, and hence also the graph.

The resulting estimator is relatively easy to compute, requiring no iterative optimization or inversion of large matrices.

We establish the consistency the convergence rate of the estimator. Through simulation studies we demonstrate that the estimator performs better than the functional Gaussian graphical model when the relations among vertices are nonlinear or heteroscedastic. The method is applied to an fMRI data set to construct brain networks for patients with attention-deficit/hyperactivity disorder.

**Friday, October 28: Dr. Glen Laird, Sanofi**

214 Duxbury Hall, 10:00am

Title: Statistical Considerations for Pharmaceutical Industry Clinical Trials

Abstract: Clinical trials are the key evidence drivers for the pharmaceutical industry. These trials use a set of statistical methods particular to the setting and regulatory environment. In the context of oncology clinical trials an overview of selected methodological topics will be presented including multiplicity, dose escalation methods, Simon designs, and interim analyses.

**Friday, November 4: Dr. Andre Rogatko, Cedars-Sinai Medical Center**

214 Duxbury Hall, 10:00am

Title: Dose Finding with Escalation with Overdose Control in Cancer Clinical Trials

Abstract: Escalation With Overdose Control (EWOC) is a Bayesian adaptive dose finding design that produces consistent sequences of doses while controlling the probability that patients are overdosed. EWOC was the first dose-finding procedure to directly incorporate the ethical constraint of minimizing the chance of treating patients at unacceptably high doses. Its defining property is that the expected proportion of patients treated at doses above the maximum tolerated dose (MTD) is equal to a specified value α, the feasibility bound. Topics to be discussed include: two-parameter logistic model, use of covariate in prospective clinical trial, drug combinations, and Web-EWOC, a free interactive web tool for designing and conducting dose finding trials in cancer https://biostatistics.csmc.edu/ewoc/ewocWeb.php.

**Friday, November 18: Dr. Mike Daniels, University of Texas at Austin**

214 Duxbury Hall, 10:00am

Title: To be announced

Abstract: To be announced

**Friday, December 2: Dr. Martin Lindquist, Johns Hopkins University**

214 Duxbury Hall, 10:00am

Title: High-dimensional Multivariate Mediation with Application to Neuroimaging Data

Abstract: Mediation analysis is an important tool in the behavioral sciences for investigating the role of intermediate variables that lie in the path between a randomized treatment/exposure and an outcome variable. The influence of the intermediate variable on the outcome is often explored using structural equation models (SEMs), with model coefficients interpreted as possible effects. While there has been significant research on the topic in recent years, little work has been done on mediation analysis when the intermediate variable (mediator) is a high-dimensional vector. In this work we introduce a novel method for mediation analysis in this setting called the directions of mediation (DMs). The DMs represent an orthogonal transformation of the space spanned by the set of mediators, chosen so that the transformed mediators are ranked based upon the proportion of the likelihood of the full SEM that they explain. We provide an estimation algorithm and establish the asymptotic properties of the obtained estimators. We demonstrate the method using a functional magnetic resonance imaging (fMRI) study of thermal pain where we are interested in determining which brain locations mediate the relationship between the application of a thermal stimulus and self- reported pain.