Spring 2022 Colloquia

Spring 2022 Colloquia

Friday, April 15th: Lan Wang (University of Miami)

11:00 a.m. via Zoom

Title: Transformation-Invariant Learning of Optimal Individualized Decision Rules with Time-to-Event Outcomes

Abstract:

In many important applications of precision medicine, the outcome of interest is time to an event (e.g., death, relapse of disease) and the primary goal is to identify the optimal individualized decision rule (IDR) to prolong survival time. Existing work in this area has been mostly focused on estimating the optimal IDR to maximize the restricted mean survival time in the population.  We propose a new robust framework for estimating an optimal static or dynamic IDR with time-to-event outcomes based on an easy-to-interpret quantile criterion.  The new method does not need to specify an outcome regression model and is robust for heavy-tailed distribution. The estimation problem corresponds to a nonregular M-estimation problem with both finite and infinite-dimensional nuisance parameters.  Employing advanced empirical process techniques, we establish the statistical theory of the estimated parameter indexing the optimal IDR. Furthermore, we prove a novel result that the proposed approach can consistently estimate the optimal value function under mild conditions even when the optimal IDR is non-unique, which happens in the challenging setting of exceptional laws. We also propose a smoothed resampling procedure for inference. The proposed methods are implemented in the R-package QTOCen.  We demonstrate the performance of the proposed new methods via extensive Monte Carlo studies and a real data application.

 

Friday, April 8th: Michael Kosorok (University of North Carolina at Chapel Hill)

11:00 a.m. via Zoom

Title: Nonparametric Reinforcement Learning for Survival Outcomes

Abstract: In some disease settings, such as cancer, there are several stages in the course of treatment where decisions on what treatment to select are made. A key goal of precision medicine in these settings is to determine optimal treatment based on patient status and history. In this presentation, we discuss a flexible new approach to making these determinations based on observed data, with the goal of maximizing a right-censored event time such as overall survival. Our method combines nonparametric random survival forests with reinforcement learning. We show that this algorithm is able to consistently estimate the optimal dynamic treatment regime with reasonable assumptions and performs better than alternative approaches. In addition to theoretical results, we also provide simulation studies and illustrate with an analysis of data from a myeloid leukemia trial involving two stages of treatment, where the first stage is randomized and the second stage is not.

 

Friday, March 25th: Krista Gile (University of Massachusettes Amherst)

11:00 a.m. via Zoom

Title: Inference from Multivariate Respondent-Driven Sampling Data

Abstract: Respondent-Driven Sampling is type of link-tracing network sampling used to study hard-to-reach human populations.  Beginning with a convenience sample, each person sampled is given 2-3 uniquely identified coupons to distribute to other members of the target population, making them eligible for enrollment in the study. This is effective at collecting large diverse samples from many populations.

Due to the complexity of the sampling process, inference for the most fundamental of population features: population proportion, is challenging, and has been the subject of much work in recent years, typically using only data on local network size and the variable of interest.

This talk focuses on inference using multiple variables measured on participants.  We describe using data on local network composition for a variable biasing recruitment to adjust for preferential recruitment, semi-parametric testing for bivariate associations in the RDS dataset, and methods for clustering RDS participants based on covariate and referral data.

 

Friday, March 11th: Ting Ye (University of Washington)

11:00 a.m. via Zoom

Title: Robust Mendelian Randomization in the Presence of Many Weak Instruments and Widespread Horizontal Pleiotropy

Abstract: Mendelian randomization (MR) has become a popular approach to study causal effects by using genetic variants as instrumental variables. We propose a new MR method, GENIUS-MAWII, which simultaneously addresses the two salient phenomena that adversely affect MR analyses: many weak instruments and widespread horizontal pleiotropy. Similar to MR GENIUS (Tchetgen Tchetgen et al., 2021), we achieve identification of the treatment effect by leveraging heteroscedasticity of the exposure. We then derive the class of influence functions of the treatment effect, based on which, we construct a continuous updating estimator and establish its consistency and asymptotic normality under a many weak invalid instruments asymptotic regime by developing novel semiparametric theory. We also provide a measure of weak identification and graphical diagnostic tool. We demonstrate in simulations that GENIUS-MAWII has clear advantages in the presence of directional or correlated horizontal pleiotropy compared to other methods. We apply our method to study the effect of body mass index on systolic blood pressure using UK Biobank.

 

Friday, March 4th: Hongyu Miao (Florida State University)

11:00 a.m. via Zoom

Title: Non-Euclidean Statistics for Novel Digital Biomarker Identification

Abstract: The identification, verification and application of digital biomarkers are of significant scientific interest and importance in many biomedical and healthcare disciplines. However, there still exist numerous challenges in developing more efficient and accurate statistical and data science methodologies for digital biomarker analytics. The focus of this study is thus to develop novel statistical approaches to fill the methodological gap, especially for network data derived from brain imaging. Network data contain numeric, topological, and geometrical information, and are thus necessarily considered on manifold for appropriate machine learning and statistical analysis. In this study, a novel framework is presented for two-sample comparison of networks. Specifically, an approximation distance metric to quotient Euclidean distance is proposed, and then combined with network spectral distance to quantify the local and global dissimilarity of networks simultaneously. A permutational non-Euclidean analysis of variance is adapted to the proposed distance metric for the comparison of two independent groups of networks. Comprehensive simulation studies and real applications (e.g., ADHD, ADRD) are conducted to demonstrate the superior performance of our method over other alternatives. The asymptotic properties of the proposed test are investigated, and its high-dimensional extension is discussed as well.

 

Wednesday, January 26th: Michael Law (University of Michigan)

10:00 a.m. via Zoom

Title: Non-Euclidean Statistics for Novel Digital Biomarker Identification

Abstract: The identification, verification and application of digital biomarkers are of significant scientific interest and importance in many biomedical and healthcare disciplines. However, there still exist numerous challenges in developing more efficient and accurate statistical and data science methodologies for digital biomarker analytics. The focus of this study is thus to develop novel statistical approaches to fill the methodological gap, especially for network data derived from brain imaging. Network data contain numeric, topological, and geometrical information, and are thus necessarily considered on manifold for appropriate machine learning and statistical analysis. In this study, a novel framework is presented for two-sample comparison of networks. Specifically, an approximation distance metric to quotient Euclidean distance is proposed, and then combined with network spectral distance to quantify the local and global dissimilarity of networks simultaneously. A permutational non-Euclidean analysis of variance is adapted to the proposed distance metric for the comparison of two independent groups of networks. Comprehensive simulation studies and real applications (e.g., ADHD, ADRD) are conducted to demonstrate the superior performance of our method over other alternatives. The asymptotic properties of the proposed test are investigated, and its high-dimensional extension is discussed as well.

 

Monday, January 24th: Abhishek Roy (University of California, Davis)

10:00 a.m. via Zoom

Title: Sequential Decision Making: Nonconvexity and Nonstationarity

Abstract: Numerous statistical problems including dynamic matrix sensing and completion, and online reinforcement learning can be formulated as nonconvex optimization problem where the objective function changes over time.   In this work,  we propose and analyze stochastic zeroth-order optimization algorithms in online learning setting for nonconvex functions in a nonstationary environment.  We propose nonstationary versions of regret measures based on first-order and second-order optimal solutions and establish sub-linear regret bounds on these proposed regret measures.  The main take-away from this work is that one can track statistically favorable solution, i.e., stationary point or local minima of the underlying nonconvex objective function of a statistical learning problem even in a nonstationary environment. For the case of first-order optimal solution-based regret measures, we provide regret bounds for stochastic gradient descent algorithm.  

For the case of second-order optimal solution-based regret, we analyze stochastic cubic-regularized Newton’s Method.  We establish the regret bounds in the zeroth-order  oracle setting where one has access to noisy evaluations of the objective function only.  We illustrate our results through simulation as well as several learning problems.

 

Friday, January 21st: Weijing Tang (University of Michigan)

10:00 a.m. via Zoom

Title: Survival Analysis via Ordinary Differential Equations

Abstract: Survival analysis is an extensively studied branch of statistics with wide applications in various fields. Despite rich literature on survival analysis, the growing scale and complexity of modern data create new challenges that existing statistical models and estimation methods cannot meet. In the first part of this talk, I will introduce a novel and unified ordinary differential equation (ODE) framework for survival analysis. I will show that this ODE framework allows flexible modeling and enables a computationally and statistically efficient procedure for estimation and inference. In particular, the proposed estimation procedure is scalable, easy-to-implement, and applicable to a wide range of survival models. In the second part, I will present how the proposed ODE framework can be used to address the intrinsic optimization challenge in deep learning survival analysis, so as to accommodate data in diverse formats.

 

Wednesday, January 19th: Lu Zhang (Columbia University)

10:00 a.m. via Zoom

Title: Spatial Factor Modeling: A Bayesian Matrix-Normal Approach for Massive Spatial Data with Missing Observations

Abstract: Multivariate spatially-oriented data sets are prevalent in the environmental and physical sciences. Scientists seek to jointly model multiple variables, each indexed by a spatial location, to capture any underlying spatial association for each variable and associations among the different dependent variables. Multivariate latent spatial process models have proved effective in driving statistical inference and rendering better predictive inference at arbitrary locations for the spatial process. High-dimensional multivariate spatial data, which is the theme of this talk, refers to data sets where the number of spatial locations and the number of spatially dependent variables are very large. The field has witnessed substantial developments in scalable models for univariate spatial processes, but such methods for multivariate spatial processes, especially when the number of outcomes are moderately large, are limited in comparison. In this work, we extend scalable modeling strategies for a single process to multivariate processes. We pursue Bayesian inference which is attractive for full uncertainty quantification of the latent spatial process. Our approach exploits distribution theory for the Matrix-Normal distribution, which we use to construct scalable versions of a hierarchical linear model of coregionalization (LMC) and spatial factor models that deliver inference over a high-dimensional parameter space including the latent spatial process. We illustrate the computational and inferential benefits of our algorithms over competing methods using simulation studies and an analysis of a massive vegetation index data set.

 

Tuesday, January 18th: Toryn Schafer (Cornell University)

10:00 a.m. via Zoom

Title: Bayesian Inverse Reinforcement Learning for Collective Animal Movement

Abstract: The estimation of the spatio-temporal dynamics of animal behavior processes is complicated by nonlinear interactions among individuals and with the environment. Agent-based method sallow for defining simple rules that generate complex group behaviors, but are statistically challenging to estimate and assume behavioral rules are known a priori. Instead of making simplifying assumptions across all anticipated scenarios, inverse reinforcement learning provides inference on the short-term (local) rules governing long term behavior policies or choices by using properties of a Markov decision process. We use the computationally efficient linearly-solvable Markov decision process (LMDP) to learn the local rules governing collective movement. The estimation of the immediate and long-term behavioral decision costs is done in a Bayesian framework. The use of basis function smoothing is used to induce smoothness in the costs across the state space. We demonstrate the advantage of the LMDP for estimating dynamics for a classic collective movement agent-based model, the self propelled particle model. Then, we present the first data application of IRL using the introduced methodology for collective movement of guppies in a tank and estimate trade offs between social and navigational decisions. Lastly, a brief discussion on the connections to traditional resource selection functions in ecology demonstrates the future potential advantage of LMDPs for inference on behavioral decisions as a result of an accumulation of behavioral costs.

 

Friday, January 14th: Michael Jauch (Cornell University)

10:00 a.m. via Zoom

Title: Mixture Representations for Likelihood Ratio Ordered Distributions

Abstract: In many statistical applications, subject matter knowledge or theoretical considerations suggest that two distributions should satisfy a stochastic order, with samples from one distribution tending to be larger than those from the other. In these situations, incorporating stochastic order constraints can lead to improved inferences. This talk will introduce mixture representations for distributions satisfying a likelihood ratio order. To illustrate the practical value of the mixture representations, I’ll address the problem of density estimation for likelihood ratio ordered distributions. In particular, I'll propose a nonparametric Bayesian solution which takes advantage of the mixture representations. The prior distribution is constructed from Dirichlet process mixtures and has large support on the space of pairs of densities satisfying the monotone ratio constraint. With a simple modification to the prior distribution, we can also test the equality of two distributions against the alternative of likelihood ratio ordering. I’ll demonstrate the approach in two biomedical applications.

 

Wednesday, January 12th: Rounak Dey (Harvard T.H. Chan School of Public Health)

10:00 a.m. via Zoom

Title: An Efficient and Accurate Frailty Model Approach for Genome-Wide Survival Association Analysis Controlling for Genetic Ancestry Structure and Relatedness in Large-Scale Biobank

Abstract: With decades of electronic health records linked to genetic data, large biobanks provide unprecedented opportunities for systematically understanding the genetics of the natural history of complex diseases. Genome-wide survival association analysis can identify genetic variants associated with ages of onset, disease progression, and lifespan. Apart from the obvious computational challenge that such analyses entail, statistical methods also need to adjust for unknown genetic ancestry structures and familial relatedness among the biobank participants. Further, due to the cohort-based recruitment strategy typically followed in biobanks, most phenotypes have severe heavy-censoring which can lead to extreme type I error inflation in standard asymptotic tests of no genetic effects. We developed an efficient and accurate frailty model approach for genome-wide survival association analysis of censored time-to-event (TTE) phenotypes by accounting for both population structure and relatedness. Our method utilizes state-of-the-art optimization strategies to reduce the computational cost, and the saddlepoint approximation to allow for the analysis of heavily censored phenotypes (>90%) and low-frequency genetic variants (down to minor allele count 20). We demonstrated the performance of our method through extensive simulation studies and analysis of five TTE phenotypes, including lifespan, with heavy censoring rates (90.9% to 99.8%) on ~400,000 UK Biobank participants and ~180,000 individuals in FinnGen. We further analyzed 871 TTE phenotypes in the UK Biobank and presented the genome-wide scale phenome-wide association (PheWAS) results with the PheWeb browser.

 

Monday, January 10th: Yubai Yuan (University of California, Irvine)

10:00 a.m. via Zoom

Title: High-order Joint Embedding for Multi-Level Link Prediction

Abstract: Link prediction infers potential links from observed networks, and is one of the essential problems in network analyses. In contrast to traditional graph representation modeling which only predicts two-way pairwise relations, we propose a novel tensor-based joint network embedding approach on simultaneously encoding pairwise links and hyperlinks onto a latent space, which captures the dependency between pairwise and multi-way links in inferring potential unobserved hyperlinks. The major advantage of the proposed embedding procedure is that it incorporates both the pairwise relationships and subgroup-wise structure among nodes to capture richer network information. In addition, the proposed method introduces a hierarchical dependency among links to infer potential hyperlinks, and leads to better link prediction. In theory we establish the estimation consistency for the proposed embedding approach, and provide a faster convergence rate compared to link prediction utilizing pairwise links or hyperlinks only. Numerical studies on both simulation settings and Facebook ego-networks indicate that the proposed method improves both hyperlink and pairwise link prediction accuracy compared to existing link prediction algorithms.
This is a joint work with Prof. Annie Qu in UC-Irvine

 

Friday, January 7th: Likun Zhang (Lawrence Berkeley National Laboratory)

10:00 a.m. via Zoom

Title: Modeling Extremal Dependence in Trend Analysis of in Situ Measurements of Daily Precipitation Extremes

Abstract: The detection of changes over time in the distribution of precipitation extremes is significantly complicated by noise at the spatial scale of daily weather systems. This so-called "storm dependence" is non-negligible for extreme precipitation and makes detecting changes over time very difficult. To appropriately separate spatial signals from spatial noise due to storm dependence, we first utilize a well-developed Gaussian scale mixture model that directly incorporates extremal dependence. Our method uses a data-driven approach to determine the dependence strength of the observed proces (either asymptotic independence or dependence) and is generalized to analyze changes over time and increase the scalability of computations. We apply the model to daily measurements of precipitation over the central United States and compare our results with single-station and conditional independence methods. Our main finding is that properly accounting for storm dependence leads to increased detection of statistically significant trends in the climatology of extreme daily precipitation. Next, in order to extend our analysis to much larger spatial domains, we propose a mixture component model that achieves flexible dependence properties and allows truly high-dimensional inference for extremes of spatial processes. We modify the popular random scale construction via adding non-stationarity to the Gaussian process while allowing the radial variable to vary smoothly across space. As the level of extremeness increases, this single model exhibits both long-range asymptotic independence and short-range weakening dependence strength that leads to either asymptotic dependence or independence. To make inference on the model parameters, we construct global Bayesian hierarchical models and run adaptive Metropolis algorithms concurrently via parallelization. For future work to allow efficient computation, we plan to explore local likelihood and dimension reduction approaches.

 

Thursday, January 6th: Joshua Loyal (University of Illinois Urbana-Champaign)

10:00 a.m. via Zoom

Title: An Eigenmodel for Dynamic Multilayer Networks

Abstract: Network (or graph) data is at the heart of many modern data science problems: disease transmission, community dynamics on social media, international relations, and others. In this talk, I will elaborate on my research in statistical inference for complex time-varying networks. I will focus on dynamic multilayer networks, which frequently represent the structure of multiple co-evolving relations. Despite their prevalence, statistical models are not well-developed for this network type. Here, I propose a new latent space model for dynamic multilayer networks. The key feature of this model is its ability to identify common time-varying structures shared by all layers while also accounting for layer-wise variation and degree heterogeneity. I establish the identifiability of the model's parameters and develop a structured mean-field variational inference approach to estimate the model's posterior, which scales to networks previously intractable to dynamic latent space models. I apply the model to two real-world problems: discerning regional conflicts in a data set of international relations and quantifying infectious disease spread throughout a school based on the student's daily contact patterns.

 

Previous Colloquia

Fall 2021 Colloquia

Spring 2021 Colloquia

Fall 2020 Colloquia

Spring 2020 Colloquia

Fall 2019 Colloquia

Spring 2019 Colloquia

Fall 2018 Colloquia

Spring 2018 Colloquia

Fall 2017 Colloquia

Spring 2016 Colloquia Part II

Fall 2016 Colloquia

Spring 2016 Colloquia