Spring 2019 Colloquia
Upcoming Colloquia: No Additional Colloquia Scheduled for Spring 2019
Tuesday, January 8th: Dr. Chengchun Shi, North Carolina State University
204 Duxbury Hall, 2:00pm
Title: On Statistical Learning for Individualized Decision Making with Complex Data
Abstract: In this talk, I will present my research on individualized decision making with modern complex data. In precision medicine, individualizing the treatment decision rule can capture patients' heterogeneous response towards treatment. In finance, individualizing the investment decision rule can improve individual's financial well-being. In a ride-sharing company, individualizing the order dispatching strategy can increase its revenue and customer satisfaction. With the fast development of new technology, modern datasets often consist of massive observations, high-dimensional covariates and are characterized by some degree of heterogeneity.
The talk is divided into two parts. In the first part, I will focus on the data heterogeneity and introduce a new maximin-projection learning for recommending an overall individualized decision rule based on the observed data from different populations with heterogeneity in optimal individualized decision making. In the second part, I will briefly summarize the statistical learning methods I've developed for individualized decision making with complex data and discuss my future research directions.
Friday, January 11th: Dr. Ryan Sun, Harvard University
214 Duxbury Hall, 10:00am
Title: Set-based Inference for Integrative Analysis of Genetic Compendiums
Abstract: The increasing popularity of biobanks and other genetic compendiums has introduced exciting opportunities to extract knowledge using datasets combining information from a variety of genetic, genomic, environmental, and clinical sources. To manage the large number of association tests that may be performed with such data, set-based inference strategies have emerged as a popular alternative to testing individual features. Set-based tests enjoy natural advantages including a decreased multiplicity burden and superior interpretations in certain settings. However, existing methods are often challenged to provide adequate power due to three issues in particular: sparse signals, weak effect sizes, and features exhibiting a diverse variety of correlation structures. Motivated by these challenges, we propose the Generalized Berk-Jones (GBJ) statistic, a set-based association test designed to detect rare and weak signals while explicitly accounting for arbitrary correlation patterns. Consistent with its formulation as a generalization of the Berk-Jones statistic, GBJ demonstrates improved power compared to other set-based tests over a variety of moderately sparse settings. We apply GBJ to perform inference on sets of genotypes and sets of phenotypes, and we also discuss strategies for situations where the global null is not the null hypothesis of interest.
Tuesday, January 15th: Yuefeng Han, University of Chicago
214 Duxbury Hall, 2:00pm
Title: Learning High Dimensional Time Series Data
Abstract: High-dimensional temporal dependent data arise in a wide range of disciplines. Despite its widespread applicability, however, methods and theoretical tools to analyze such data remain poorly investigated. My talk will mainly focus on three problems. The first part aims at prediction for high dimensional linear processes. Then I will introduce a new framework for high dimensional non-parametric additive Vector Autoregressive (VAR) models. Methodology and computationally efficient algorithms are developed under this new framework. Finally, I will present theoretical tools, optimal Bernstein-type inequalities for suprema of empirical processes with dependent data, equipped with which we can also establish a statistical learning theory for dependent data.
Friday, January 18th: Dr. Joshua Cape, John Hopkins University
214 Duxbury Hall, 10:00am
Title: Statistical Analysis and Spectral Methods for Signal-Plus-Noise Matrix Models
Abstract: Estimating eigenvectors and principal subspaces is of fundamental importance for numerous problems in statistics, data science, and network analysis, including covariance matrix estimation, principal component analysis, and community detection. For each of these problems, we obtain foundational results that precisely quantify the local (e.g., entrywise) behavior of sample eigenvectors within the context of a unified signal-plus-noise matrix framework. Our methods and results collectively address eigenvector consistency and asymptotic normality, decompositions of high-dimensional matrices, Procrustes analysis, deterministic perturbation bounds, and real-data spectral clustering applications in connectomics.
Friday, January 25th: Dr. Zhengling Qi, University of North Carolina
214 Duxbury Hall, 10:00am
Title: Learning Optimal Individualized Decision Rules with Risk Control
Abstract: With the emergence of precision medicine, estimation of optimal individualized decision rules (IDRs) has attracted tremendous attentions in many scientific areas. Most existing literature has focused on finding optimal IDRs that can maximize the expected outcome for each individual. Motivated by complex individualized decision making procedures and the popular conditional value at risk, in this talk, I will introduce two new robust criteria to evaluate IDRs: one is focused on the average lower tail of the subjects’ outcomes and the other is on the individualized lower tail of each subject’s outcome. The proposed criteria take tail behaviors of the outcome into consideration, and thus the resulting optimal IDRs are robust in controlling adverse events. The optimal IDRs under our criteria can be interpreted as the distributionally robust decision rules that maximize the “worst-case” scenario of the outcome within a probability constrained set. Simulation studies and a real data application are used to demonstrate the robust performance of our methods. Finally, I will introduce a more general decision-rule based optimized covariates dependent equivalent framework for individualized decision making with risk control.
Tuesday, January 29th: Dr. Fei Gao, University of Washington
214 Duxbury Hall, 2:00pm
Title: Non-iterative Estimation Update for Parametric and Semiparametric Models with Population-based Auxiliary Information
Abstract: With the advancement in disease registries and surveillance data, population-based information on disease incidence, survival probability or other important biological characteristics become increasingly available. Such information can be leveraged in studies that collect detailed measurements but with smaller sample sizes. In contrast to recent proposals that formulate the additional information as constraints in optimization problems, we develop a general framework to construct simple estimators that update the usual regression estimators with some functionals of data that incorporate the additional information. We consider general settings which include nuisance parameters in the auxiliary information, non-i.i.d. data such as case-control sampling, and semiparametric models with in nite dimensional parameters. Detailed examples of several important data and sampling settings are provided.
Friday, February 1st: Dr. Chao Huang, University of North Carolina at Chapel Hill
214 Duxbury Hall, 10:00am
Title: Surrogate Variable Analysis for Multivariate Functional Responses in Imaging Data
Abstract: With the rapid growth of modern technology, many large-scale biomedical studies, e.g., Alzheimer’s disease neuroimaging initiative (ADNI) study, have been conducted to collect massive datasets with large volumes of complex information from increasingly large cohorts. Despite the numerous successes of biomedical studies, the imaging heterogeneity has posed many challenges in both data integration and disease etiology. Specifically, imaging heterogeneity often represents at three different levels: subject level, group level, and study level. This talk mainly focuses on the heterogeneity at study level. The study-level heterogeneity can result from the difference in study environment, population, design, and protocols, which are mostly unknown. Surrogate variable analysis (SVA), which is a powerful tool in tackling this heterogeneity, has been widely used in genomic studies. However, the imaging data is usually represented as functional phenotype while no existing SVA procedures work for functional responses. To address these challenges, a functional latent factor regression model (FLFRM) is proposed to handle the unknown factors. Several inference procedures are established for estimating the unknown parameters and detecting the latent factors. The consistency of estimate of latent variables and the weak convergence of estimate of parameters are systematically investigated. The finite-sample performance of proposed procedures is assessed by Monte Carlo simulations and a real data example on hippocampal surface data from ADNI study.
Tuesday, February 5th: Dr. Andres Felipe Barrientos, Duke University
214 Duxbury Hall, 2:00pm
Title: Bayesian nonparametric models for compositional data
Abstract: We propose Bayesian nonparametric procedures for density estimation for compositional data, i.e., data in the simplex space. To this aim, we propose prior distributions on probability measures based on modified classes of multivariate Bernstein polynomials. The resulting prior distributions are induced by mixtures of Dirichlet distributions, with random weights and a random number of components. Theoretical properties of the proposal are discussed, including large support and consistency of the posterior distribution. We use the proposed procedures to define latent models and apply them to data on employees of the U.S. federal government. Specifically, we model data comprising federal employees’ careers, i.e., the sequence of agencies where the employees have worked. Our modeling of the employees’ careers is part of a broader undertaking to create a synthetic dataset of the federal workforce. The synthetic dataset will facilitate access to relevant data for social science research while protecting subjects’ confidential information.
Friday, February 8th: Dr. Jingshu Wang, University of Pennsylvania
214 Duxbury Hall, 10:00am
Title: Data Denoising for Single-cell RNA sequencing
Abstract: Single-cell RNA sequencing (scRNA-seq) measures gene expression levels in every single cell, which is a ground-breaking technology over microarrays and bulk RNA sequencing and reshapes the field of biology. Though the technology is exciting, scRNA-seq data is very noisy and often too noisy for signal detection and robust analysis. In the talk, I will discuss how we perform data denoising by learning across similar genes and borrowing information from external public datasets to improve the quality of downstream analysis.
Specifically, I will discuss how we set up the model by decomposing the randomness of scRNA-seq data into three components, the structured shared variations across genes, biological “noise” and technical noise, based on current understandings of the stochasticity in DNA transcription. I will emphasize one key challenge in each component and our contributions. I will show how we make proper assumptions on the technical noise and introduce a key feature, transfer learning, in our denoising method SAVER-X. SAVER-X uses a deep autoencoder neural network coupled with Empirical Bayes shrinkage to extract transferable gene expression features across datasets under different settings and learn from external data as prior information. I will show that SAVER-X can successfully transfer information from mouse to human cells and can guard against bias. I'll also briefly discuss our ongoing work on post-denoising inference for scRNA-seq.
Tuesday, February 12th: Dr. Abhishek Chakrabortty, University of Pennsylvania
214 Duxbury Hall, 2:00pm
Title: Semi-Supervised Inference with Large and High Dimensional Data: A Semi-Parametric Perspective
Abstract: The abundance of large and complex datasets in the current big data era has also created a host of novel statistical challenges for properly harnessing such rich (but often incomplete) information. One such challenge includes statistical inference in semi-supervised (SS) settings, where apart from a moderate sized supervised data (L), one also has a much larger sized unsupervised data (U) available. Such datasets arise naturally when the response, unlike the covariates, is difficult and/or expensive to obtain, a frequent scenario in modern studies involving large databases, including biomedical data like electronic health records (EHR). It is natural to investigate whether and how the information from U can be exploited to improve efficiency over a given supervised approach.
In this talk, I will consider SS inference for a class of standard Z-estimation problems. I will discuss first the subtleties and associated challenges that necessitate a semi-parametric perspective. I will then demonstrate a family of SS Z-estimators that are robust and adaptive, thus ensuring that they are always as efficient as the supervised estimator and more efficient (optimal in some cases) when the information from U actually relates to the parameter of interest. These properties are crucial for advocating ‘safe’ use of unlabeled data and are often unaddressed. Our framework provides a much needed unified understanding of these problems. Multiple EHR data applications are also presented to exhibit the practical benefits of our estimator. In the later part of the talk, I consider SS inference in high dimensional settings, and demonstrate the remarkable benefits the unlabeled data provides in seamlessly obtaining a family of SS estimators with asymptotic linear expansions, without directly requiring any sparsity conditions or debiasing needed in supervised settings. This, in particular, facilitates high dimensional inference under minimal assumptions.
Friday, February 15th: Dr. Hai Shu, The University of Texas MD Anderson Cancer Center
214 Duxbury Hall, 10:00am
Title: Extracting Common and Distinctive Signals from High-dimensional Datasets
Abstract: Modern biomedical studies often collect large-scale multi-source/-modal datasets on a common set of objects. A typical approach to the joint analysis of such high-dimensional datasets is to decompose each data matrix into three parts: a low-rank common matrix that captures the shared information across datasets, a low-rank distinctive matrix that characterizes the individual information within the single dataset, and an additive noise matrix. Existing decomposition methods often focus on the orthogonality between the common and distinctive matrices, but inadequately consider a more necessary orthogonal relationship among the distinctive matrices. The latter guarantees that no more shared information is extractable from the distinctive matrices. We propose decomposition-based canonical correlation analysis (D-CCA), a novel decomposition method that defines the common and distinctive matrices from the L2 space of random variables rather than the conventionally used Euclidean space, with a carefully designed orthogonal relationship among the distinctive matrices. The associated estimators of common and distinctive signal matrices are asymptotically consistent and have reasonably better performance than state-of-the-art methods in both simulated data and the analyses of breast cancer genomic datasets from The Cancer Genome Atlas and motor-task functional MRI data from the Human Connectome Project.
Friday, February 22nd: Dr. Taps Maiti, Michigan State University
214 Duxbury Hall, 10:00am
Title: Statistical Learning with High-dimensional Structurally Dependent Data
Abstract: The rapid development of information technology is making it possible to collect massive amounts of multidimensional, multimodal data with high dimensionality in diverse fields of science and engineering. New statistical learning and data mining methods have been developing accordingly to solve challenging problems arising out of these complex systems. In this talk, we will discuss a specific type of statistical learning, namely the problem of feature selection and classification when the features are high dimensional and structured, specifically, they are spatio-temporal in nature. Various machine learning techniques are suitable for this type of problems although the underlying statistical theories are not well established. We will discuss linear discriminant analysis based technique under spatial dependence and their theoretical and numerical properties. The work is motivated from analyzing brain imaging data.
Friday, March 8th: Dr. Ian Dryden, University of Nottingham
214 Duxbury Hall, 10:00am
Title: Manifold Valued Data Analysis of Samples of Networks
Abstract: Networks can be used to represent many systems such as text documents, social interactions and brain activity, and it is of interest to develop statistical techniques to compare networks. We develop a general framework for extrinsic statistical analysis of samples of networks, motivated by networks representing text documents in corpus linguistics. We identify networks with their graph Laplacian matrices, for which we define metrics, embeddings, tangent spaces, and a projection from Euclidean space to the space of graph Laplacians. This framework provides a way of computing means, performing principal component analysis and regression, and performing hypothesis tests, such as for testing for equality of means between two samples of networks. We apply the methodology to the set of novels by Jane Austen and Charles Dickens.
This is joint work with Katie Severn and Simon Preston.
Friday, March 15th: Dr. Eric Chi, North Carolina State University
214 Duxbury Hall, 10:00am
Title: Getting Your Arrays in Order With Convex Optimization
Abstract: Clustering is a fundamental unsupervised learning technique that aims to discover groups of objects in a dataset. Biclustering extends clustering to two dimensions where both observations and variables are grouped simultaneously, such as clustering both cancerous tumors and genes or both documents and words. We develop and study a convex formulation of the generalization of biclustering to co-clustering the modes of multiway arrays or tensors, the generalization of matrices. Our convex co-clustering (CoCo) estimator is guaranteed to obtain a unique global minimum of the formulation and generates an entire solution path of possible co-clusters governed by a single tuning parameter. We extensively study our method in several simulated settings, and also apply it to an online advertising dataset. We also provide a finite sample bound for the prediction error of our CoCo estimator.
Friday, March 29th: Dr. S. Ejaz Ahmed, Brock University
214 Duxbury Hall, 10:00am
Title: Implicit Bias in Big Data Analytics
Abstract: Nowadays a large amount of data is available, and the need for novel statistical strategies to analyze such data sets is pressing. This talk focusses on the development of statistical and computational strategies for a sparse regression model in the presence of mixed signals. The existing estimation methods have often ignored contributions from weak signals. However, in reality, many predictors altogether provide useful information for prediction, although the amount of such useful information in a single predictor might be modest. The search for such signals, sometimes called networks or pathways, is for instance an important topic for those working on personalized medicine. We discuss a new “post selection shrinkage estimation strategy” that takes into account the joint impact of both strong and weak signals to improve the prediction accuracy, and opens pathways for further research in such scenarios.
214 Duxbury Hall, 10:00am
Dr. Gregory Campbell - Ph.D (1976) FSU, President, GCStat Consulting LLC
Title: Everything You Wanted to Know about Statistical Consulting but Were Afraid to Ask
Abstract: According to John Tukey, “The best thing about being a statistician is that you get to play in everyone’s backyard.” If you are scientifically or statistically curious and would enjoy adapting statistical procedures (or inventing new ones) in a novel environment, then maybe statistical consulting is for you. While vast statistical knowledge is important, so are very good communication skills for a good statistical consultant. Some of the pitfalls of consulting in academia or industry are highlighted and advice is provided to those interested in statistical consultation. This presentation is based on experiences as a graduate student at FSU long ago, an academic at Purdue, a tenured statistical research scientist at NIH, Director of Biostatistics in FDA’s device center and now as an independent consultant.
Dr. Shanti Gomatam - Ph.D (1995) FSU, CDER statistician
Title: Biostatistics at the FDA CDER
Abstract: I work at the US Food and Drug Administration’s (FDA) Center for Drug Evaluation and Research (CDER) in the Biostatistics Division that deals with quantitative safety evaluation. In this presentation I will sketch the general structure of the FDA, CDER, and the Office of Biostatistics within CDER where most CDER statisticians are employed. I will also go through the range of opportunities and work responsibilities CDER statisticians have. Finally, I will discuss opportunities for those not currently employed with the FDA.
Friday, April 19th: Dr. Jianguo "Tony" Sun, University of Missouri
214 Duxbury Hall, 10:30am
Title: Simultaneous Estimation and Variable Selection for Incomplete Event History Data
Abstract: This talk discusses regression analysis of incomplete event history data with the focus on simultaneous estimation and variable selection. Such data commonly occur in many areas such as medical studies and social sciences, and a great deal of literature has been established for their analysis except for the variable selection problem. To address this, we will present a new method, which will be referred to as a broken adaptive ridge regression approach, and establish its asymptotic properties including the oracle property and clustering effect. Numerical studies suggest that the proposed method performs well in practical situations and better than the existing methods. An application will be presented.
Friday, April 26th: Dr. Marianna Pensky, University of Central Florida
214 Duxbury Hall, 10:00am
Title: Estimation and Clustering in Popularity Adjusted Stochastic Block Model
Abstract: Stochastic networks in general and stochastic block models in particular attracted a lot of attention in the last decade. The talk considers the Popularity Adjusted Block model (PABM) which generalizes the Stochastic Block model and the Degree Corrected Block Model by allowing more flexibility for block probabilities. We argue that the main appeal of the PABM is its less rigid spectral structure which makes the PABM an attractive choice for modeling networks that appear in biological sciences.