Spring 2021 Colloquia

Previous Colloquia:

Friday, April 16th: Qing Lu (Biostatistics, University of Florida)

Title: A Kernel-Based Neural Network for High-dimensional Risk Prediction on Massive Genetic Data

Abstract: Artificial intelligence (AI) is a thriving research field with many successful applications in areas such as computer vision and speech recognition. Neural-network-based methods (e.g., deep learning) play a central role in modern AI technology. While neural-network-based methods also hold great promise for genetic research, the high-dimensionality of genetic data, the massive amounts of study samples, and complex relationships between genetic variants and disease outcomes bring tremendous analytic and computational challenges. To address these challenges, we propose a kernel-based neural network (KNN) method. KNN inherits features from both linear mixed models (LMM) and classical neural networks and is designed for high-dimensional genetic data analysis. Unlike the classic neural network, KNN summarizes a large number of genetic variants into kernel matrices and uses the kernel matrices as input matrices. Based on the kernel matrices, KNN builds a feedforward neural network to model the complex relationship between genetic variants and a disease outcome. Minimum norm quadratic unbiased estimation and batch training are implemented in KNN to accelerate the computation, making KNN applicable to massive datasets with millions of samples. Through simulations, we demonstrate the advantages of KNN over LMM in terms of prediction accuracy and computational efficiency. We also apply KNN to the large-scale UK Biobank dataset, evaluating the role of a large number of genetic variants on multiple complex diseases.

Friday, April 9th: Wenbin Lu (Statistics, NCSU)

Title: Jump Q-Learning for Optimal Interval-Values Treatment Decision Rule

Abstract: An individualized decision rule (IDR) is a decision function that assigns each individual a given treatment based on his/her observed characteristics. Most of the existing works in the literature consider settings with binary or finitely many treatment options. In this work, we focus on the continuous treatment setting and propose a jump Q-learning to develop an individualized interval-valued decision rule (I2DR) that maximizes the expected outcome. Unlike IDRs that recommend a single treatment, the proposed I2DR yields an interval of treatment options for each individual, making it more flexible to implement in practice. To derive an optimal I2DR, our jump Q-learning method estimates the conditional mean of the response given the treatment and the covariates (the Q-function) via jump penalized regression, and derives the corresponding optimal I2DR based on the estimated Q-function. The regressor is allowed to be either linear for clear interpretation or deep neural network to model complex treatment-covariates interactions. To implement jump Q-learning, we develop a searching algorithm based on dynamic programming that efficiently computes the Q-function. Statistical properties of the resulting I2DR are established when the Q-function is either a piecewise or continuous function over the treatment space. We further develop a procedure to infer the mean outcome under the estimated optimal policy. Extensive simulations and a real data application to a warfarin study are conducted to demonstrate the empirical validity of the proposed I2DR.

Friday, April 2nd: Li Hsu (Public Health Sciences Division, Fred Hutchinson Cancer Research Center)

Title: Summary Statistics-based Methods for Genetic Association Studies: Discovery and Translation

Abstract: In the last decades, genome-wide association studies (GWAS) have been the backbone for studying the genetic etiology of complex diseases. Tens of thousands of genetic variants have been identified associated with various phenotypes, but together they explain only a fraction of heritability. As the effect sizes of genetic variants are generally modest and the number of variants is in the millions, a large sample size of GWAS is needed in order to have adequate power to detect these variants. In order to facilitate pooling GWAS data, it is common to share the summary statistics of marginal associations for meta-analysis, instead of individual-level data. It is thus of great interest to make use of these summary statistics together with a smaller set of individual-level data to perform the data analysis. In this presentation, I will present a summary statistics-based approach for joint association testing of a set of functionally informed variants, recovering the information that would have been obtained if one were to have individual level data. If time permits, I will discuss another topic on risk prediction of chronic diseases and recalibration of risk prediction models for the target population by leveraging the summary statistics from the target population.

Friday, March 19th: Chiung-Yu Huang (Biostatistics, UCSF)

Title: Censored Linear Regression in the Presence or Absence of Auxiliary Survival Information

Abstract: There has been a rising interest in better exploiting auxiliary summary information from large databases in the analysis of smaller-scale studies which collect more comprehensive patient-level information. The purpose of this paper is twofold: ﬁrstly, we propose a novel approach to synthesize information from both the aggregate and the individual-level data in censored linear regression. We show that the auxiliary information amounts to a system of nonsmooth estimating equations and thus can be combined with the conventional weighted log-rank estimating equations by applying the idea of generalized method of moments (GMM). The proposed methodology can be further extended to account for potential inconsistency in information from different sources. Secondly, in the absence of auxiliary information, we propose to improve estimation efficiency by combining overidentiﬁed weighted log-rank estimating equations with different weight functions via the GMM framework. To deal with the non-smooth GMM-type objective functions, we develop an asymptotics-guided algorithm for parameter and variance estimation. We establish the asymptotic normality of the proposed GMM-type estimators. Simulation studies show that the proposed estimators can yield substantial efficiency gain over the conventional weighted log-rank estimators. The proposed methods are applied to a pancreatic cancer study for illustration.

Friday, March 12th: Zhaoran Wang (Industrial Engineering & Management Sciences, Northwestern University)

Zoom, 11:00am

Title: Demystifying (Deep) Reinforcement Learning with Optimism and Pessimism

Abstract: Coupled with powerful function approximators such as deep neural networks, reinforcement learning (RL) achieves tremendous empirical successes. However, its theoretical understandings lag behind. In particular, it remains unclear how to provably attain the optimal policy with a finite regret or sample complexity. In this talk, we will present the two sides of the same coin, which demonstrates an intriguing duality between optimism and pessimism.
- In the online setting, we aim to learn the optimal policy by actively interacting with the environment. To strike a balance between exploration and exploitation, we propose an optimistic least-squares value iteration algorithm, which achieves a \sqrt{T} regret in the presence of linear, kernel, and neural function approximators.
- In the offline setting, we aim to learn the optimal policy based on a dataset collected a priori. Due to a lack of active interactions with the environment, we suffer from the insufficient coverage of the dataset. To maximally exploit the dataset, we propose a pessimistic least-squares value iteration algorithm, which achieves a minimax-optimal sample complexity.

Friday, March 5th: Xiang Zhou (Biostatistics, University of Michigan)

Zoom, 11:00am

Title: Statistical Analysis of Spatial Expression Pattern for Spatially Resolved Transcriptomic Studies

Abstract: Identifying genes that display spatial expression patterns in spatially resolved transcriptomic studies is an important first step towards characterizing the spatial transcriptomic landscape of complex tissues. Here, we developed a statistical method, SPARK, for identifying such spatially expressed genes in data generated from various spatially resolved transcriptomic techniques. SPARK directly models spatial count data through the generalized linear spatial models. It relies on newly developed statistical formulas for hypothesis testing, providing effective type I error control and yielding high statistical power. With a computationally efficient algorithm based on penalized quasi-likelihood, SPARK is also scalable to data sets with tens of thousands of genes measured on tens of thousands of samples. In four published spatially resolved transcriptomic data sets, we show that SPARK can be up to ten times more powerful than existing methods, revealing new biology in the data that otherwise cannot be revealed by existing approaches.

Friday, February 19th: Joseph Ibrahim (Biostatistics, University of North Carolina at Chapel Hill)

Zoom, 11:00am

Title: The Scale Transformed Power Prior for Use with Historical Data from a Different Outcome Model

Abstract: We develop the scale transformed power prior for settings where historical and current data involve different data types, such as binary and continuous data, respectively. This situation arises often in clinical trials, for example, when historical data involve binary responses and the current data involve time-to-event or some other type of continuous or discrete outcome. The power prior proposed by Ibrahim and Chen (2000) does not address the issue of different data types. Herein, we develop a current type of power prior, which we call the scale transformed power prior (straPP). The straPP is constructed by transforming the power prior for the historical data by rescaling the parameter using a function of the Fisher information matrices for the historical and current data models, thereby shifting the scale of the parameter vector from that of the historical to that of the current data. Examples are presented to motivate the need for a scale transformation and simulation studies are presented to illustrate the performance advantages of the straPP over the power prior and other informative and non-informative priors. A real dataset from a clinical trial undertaken to study a novel transitional care model for stroke survivors is used to illustrate the methodology.

Friday, February 12th: Anru Zhang (Statistics, University of Wisconsin-Madison)

Zoom, 11:00am

Title: Statistical Learning for High-dimensional Tensor Data

Abstract: The analysis of tensor data has become an active research topic in statistics and data science recently. Many high order datasets arising from a wide range of modern applications, such as genomics, material science, and neuroimaging analysis, requires modeling with high-dimensional tensors. In addition, tensor methods provide unique perspectives and solutions to many high-dimensional problems where the observations are not necessarily tensors. High-dimensional tensor problems generally possess distinct characteristics that pose unprecedented challenges to the statistical community. There is a clear need to develop novel methods, algorithms, and theory to analyze the high-dimensional tensor data.

In this talk, we discuss some recent advances in high-dimensional tensor data analysis through several fundamental topics and their applications in microscopy imaging and neuroimaging. We will also illustrate how we develop new statistically optimal methods, computationally efficient algorithms, and fundamental theories that exploit information from high-dimensional tensor data based on the modern theory of computation, non-convex optimization, applied linear algebra, and high-dimensional statistics.

Friday, February 5th: Xianyang Zhang (Statistics, Texas A&M University)

Zoom, 11:00am

Title: 2dFDR: A Two-Dimensional False Discovery Rate Control for Powerful Confounder Adjustment in Omics Association Studies

Abstract: One problem that plagues omics association studies is the loss of statistical power when adjusting for confounders and multiple testing. While there is a vast literature on multiple testing methodologies, methods that simultaneously take into account confounders and multiple testing are lacking. To fill this methodological gap, we develop 2dFDR, a linear model-based two-dimensional false discovery rate control procedure (2dFDR), for powerful confounder adjustment under multiple testing. Through extensive simulation studies and a large-scale evaluation on real data, we demonstrate that 2dFDR is substantially more powerful than the traditional procedure while controlling for false positives. In the presence of strong confounding and weak signals, power improvement could be more than 100%.

Friday, January 22nd: Jospeh Cappelleri (Executive Director of Biostatistics, Pfizer)

Zoom, 11:00am

Title: Cultivating a Career as a Statistical Collaborator in the Pharmaceutical Industry

Abstract: A major element of professional success is to cultivate an environment of collaboration within one’s work and organization. The successful 21st-century statistician needs to develop and refine first-rate quantitative skills through dedication, habitual study, and regular practice. This presentation provides an historical perspective on the eclectic role and responsibilities of statisticians specifically in the pharmaceutical industry. Recommendations are given on how statisticians there can be effective collaborators. Topics covered include the importance of finding mentor, being open and aware of professional opportunities, and developing a tolerance for change.

Friday, January 15th: Eric Lock (Biostatistics, University of Minnesota)

Zoom, 11:00am

Title: Bidimensional Linked Matrix Decomposition for Pan-Omics Pan-Cancer Analysis

Abstract: Several recent methods address the integrative dimension reduction and decomposition of linked high‐content data matrices. Typically, these methods consider one dimension, rows or columns, that is shared among the matrices. This shared dimension may represent common features measured for different sample sets (horizontal integration) or a common sample set with features from different platforms (vertical integration). This is limiting for data that take the form of bidimensionally linked matrices, e.g., multiple molecular omics platforms measured for multiple sample cohorts, which are increasingly common in biomedical studies. We propose a flexible approach to the simultaneous factorization and decomposition of variation across bidimensionally linked matrices, BIDIFAC+. This decomposes variation into a series of low-rank components that may be shared across any number of row sets (e.g., omics platforms) or column sets (e.g., sample cohorts). Our objective function extends nuclear norm penalization, is motivated by random matrix theory, and can be shown to give the mode of a Bayesian posterior distribution. We apply the method to pan-omics pan-cancer data from The Cancer Genome Atlas (TCGA), integrating data from 4 different omics platforms and 29 different cancer types.

Friday, January 8th: James Stephen Marron (Statistics, University of North Carolina at Chapel Hill)

Zoom, 11:00am

Title: Data Integration Via Analysis of Subspaces (DIVAS)

Abstract: A major challenge in the age of Big Data is the integration of disparate data types into a data analysis. That is tackled here in the context of data blocks measured on a common set of experimental cases. This data structure motivates the simultaneous exploration of the joint and individual variation within each data block. DIVAS improves earlier methods using a novel random direction approach to statistical inference, and by treating partially shared blocks. Usefulness is illustrated using mortality, cancer and neuroimaging data sets.

Previous Colloquia

Fall 2020 Colloquia

Spring 2020 Colloquia

Fall 2019 Colloquia

Spring 2019 Colloquia

Fall 2018 Colloquia

Spring 2018 Colloquia

Fall 2017 Colloquia

Spring 2016 Colloquia Part II

Fall 2016 Colloquia

Spring 2016 Colloquia