Fall 2020 Colloquia
Title: Multiple Mediators: Unraveling Intermediate Traits, or How are Things Connected?
Abstract: Mediation analysis attempts to determine whether the relationship between an independent variable (e.g., exposure) and an outcome variable can be explained, at least partially, by an intermediate variable, called a mediator. Most methods for mediation analysis focus on one mediator at a time, although multiple mediators can be jointly analyzed by structural equation models that account for correlations among the mediators. We extend the use of structural equation models for analysis of multiple mediators by creating a sparse group lasso penalized model such that the penalty considers the natural groupings of parameters that determine mediation, as well as encourages sparseness of the model parameters. This provides a way to simultaneously evaluate many mediators and select those that have the most impact, a feature of modern penalized models. Simulations are used to illustrate the benefits and limitations of our approach, and application to a study of response to rubella vaccination illustrates the utility of our approach at reducing a large number of potential mediators to a few targets of interest. Our new methods are incorporated into R software called regmed.
Title: Large-Scale Hypothesis Testing for Causal Mediation Effects with Applications in Genome-wide Epigenetic Studies
Abstract: In genome-wide epigenetic studies, it is of great scientific interest to assess whether the effect of an exposure on a clinical outcome is mediated through DNA methylations. However, statistical inference for causal mediation effects is challenged by the fact that one needs to test a large number of composite null hypotheses across the whole epigenome. Two popular tests, the Wald-type Sobel's test and the joint significant test are underpowered and thus can miss important scientific discoveries. In this paper, we show that the null distribution of Sobel's test is not the standard normal distribution and the null distribution of the joint significant test is not uniform under the composite null of no mediation effect, especially in finite samples and under the singular point null case that the exposure has no effect on the mediator and the mediator has no effect on the outcome. Our results clearly explain why these two tests are underpowered, and more importantly motivate us to develop a more powerful Divide-Aggregate Composite-null Test (DACT) for the composite null hypothesis of no mediation effect by leveraging epigenome-wide data. We adopted Efron's empirical null framework for assessing statistical significance. We show that the proposed DACT method has improved power, and can well control type I error rate. Our extensive simulation studies showed that the DACT method properly controls the type I error rate and outperforms Sobel's test and the joint significance test for detecting mediation effects. We applied the DACT method to the Normative Aging Study and identified additional DNA methylation CpG sites that might mediate the effect of smoking on lung function. We then performed a comprehensive sensitivity analysis to demonstrate that our mediation data analysis results were robust to unmeasured confounding. We also developed a computationally-efficient R package DACT for public use.
Title: Predicting Disease Risk from Genomics Data
Abstract: Accurate disease risk prediction based on genetic and other factors can lead to more effective disease screening, prevention, and treatment strategies. Despite the identifications of thousands of disease-associated genetic variants through genome-wide association studies in the past 15 years, performance of genetic risk prediction remains moderate or poor for most diseases, which is largely due to the challenges in both identifying all the functionally relevant variants and accurately estimating their effect sizes. Moreover, as most genetic studies have been conducted in individuals of European ancestry, it is even more challenging to develop accurate prediction models in other populations. Furthermore, many studies only provide summary statistics instead of individual level genotype and phenotype data. In this presentation, we will discuss a number of statistical methods that have been developed to address these issues through jointly estimating effect sizes (both across genetic markers and across populations), modeling marker dependency, incorporating functional annotations, and leveraging genetic correlations among different diseases. We will demonstrate the utilities of these methods through their applications to a number of complex diseases/traits in large population cohorts, e.g. the UK Biobank data. This is joint work with Wei Jiang, Yiming Hu, Yixuan Ye, Geyu Zhou, Qiongshi Lu, and others.
Title: Three Rs — Reliability, Replicability, Reproducibility: the interplay between statistical science and data science
Abstract: The current pandemic has brought into sharp relief the essential role of data in nearly all aspects of science, government, and public health. But data is useless without explanation and interpretation, and statistical science has a long history and rich traditions of providing explanation and interpretation. However, statistical reasoning is often not well-understood, and misuse of statistical arguments has contributed to confusion over the three R’s in the title. In this talk, Reid describes how data science and statistical science together can provide a robust framework for extracting insights from data reliably, and thus contribute to both replicability and reproducibility. This is illustrated with a selection of examples from recent news articles, along with some discussion on the role of the theory of inference in this framework.
For more information about this event, please click here.
Title: Conditional Calibration for False Discovery Rate Control Under Dependence
Abstract: We introduce a new class of methods for finite-sample false discovery rate (FDR) control in multiple testing problems with dependent test statistics where the dependence is fully or partially known. Our approach separately calibrates a data-dependent p-value rejection threshold for each hypothesis, relaxing or tightening the threshold as appropriate to target exact FDR control. In addition to our general framework we propose a concrete algorithm, the dependence-adjusted Benjamini-Hochberg (dBH) procedure, which adaptively thresholds the q-value for each hypothesis. Under positive regression dependence the dBH procedure uniformly dominates the standard BH procedure, and in general it uniformly dominates the Benjamini–Yekutieli (BY) procedure (also known as BH with log correction). Simulations and real data examples illustrate power gains over competing approaches to FDR control under dependence. This is joint work with Lihua Lei.
Title: Nonparametric Estimation of Distributions and Diagnostic Accuracy Based on Group-Tested Results with Differential Misclassification
Abstract: This talk concerns the problem of estimating a continuous distribution in a diseased or nondiseased population when only group-based test results on the disease status are available. The problem is challenging in that individual disease statuses are not observed and testing results are often subject to misclassification, with further complication that the misclassification may be differential as the group size and the number of the diseased individuals in the group vary. We propose a method to construct nonparametric estimation of the distribution and obtain its asymptotic properties.
The performance of the distribution estimator is evaluated under various design considerations concerning group sizes and classification errors. The method is exemplified with data from the National Health and Nutrition Examination Survey (NHANES) study to estimate the distribution and diagnostic accuracy of C-reactive protein in blood samples in predicting chlamydia incidence.
Title: Integrative Methods for Biobank-Scale Studies
Abstract: With recent breakthroughs in cost effective genotyping has allowed the creation of ultra-large biobanks that link genetic data of millions of patients with a multitude of phenotypic measurements (usually curated from the electronic health records). The drastic increase in the number of individuals routinely analyzed in genomic studies has enabled novel statistical methods that employ fewer assumptions in estimating key parameters such as heritability explained by genomic variants. I will present methods showcasing how SNP-heritability can be estimated accurately and efficiently, both at genome-wide scale as well at particular regions in the genome.
Title: Efficient Integration of EHR and Other Healthcare Datasets
Abstract: The growth of availability and variety of healthcare data sources has provided unique opportunities for data integration and evidence synthesis, which can potentially accelerate knowledge discovery and enable better clinical decision making. However, many practical and technical challenges, such as data privacy, high-dimensionality and heterogeneity across different datasets, remain to be addressed. In this talk, I will introduce several methods for effective and efficient integration of electronic health records (EHR) and other healthcare datasets. Specifically, we develop communication-efficient distributed algorithms for jointly analyzing multiple datasets without the need of sharing patient-level data. Our algorithms do not require iterative communication across sites, and are able to account for heterogeneity across different datasets. We provide theoretical guarantees for the performance of our algorithms, and examples of implementing the algorithms to real-world clinical research networks.
Title: PPA: Principal Parcellation Analysis for Human Brain Connectomes of Multiple Human Traits
Abstract: Human brain parcellation plays a fundamental role in neuroimaging. Standard practice parcellates the brain into Regions Of Interest (ROIs) based roughly on anatomical function. However, many different schemes are available involving different numbers and locations of ROIs, and choosing which scheme to use in practice is challenging. We propose a novel tractography-based Principal Parcellation Analysis (PPA), which conducts the clustering analysis on the fibers' ending points to redefine parcellation and eventually predict human traits. Specifically, our PPA eliminates the need to choose ROIs manually, reduces subjectivity and leads to a substantially different representation of the connectome. We illustrate the proposed approach through applications to HCP data and show that PPA connectomes are able to improve power in predicting a variety of human traits, while dramatically improving parsimony, compared to anatomical parcellation based connectomes.
Title: Scalable and Consistent Estimation of Random Graph Models With Dependent Edge Variables and Parameter Vectors of Increasing Dimension Using the Pseudolikelihood
Abstract: An important question in statistical network analysis is how to construct models of dependent network data without sacrificing computational scalability and statistical guarantees. In this talk, we demonstrate that scalable estimation of random graph models with dependent edges and parameter vectors of increasing dimension is possible, using maximum pseudolikelihood estimators. On the statistical side, we establish the first consistency results and convergence rates for maximum pseudolikelihood estimators in scenarios where a single observation of dependent random variables is available and the number of parameters increases without bound. The main results make weak assumptions and may be of independent interest. These results help establish the first consistency results and convergence rates for maximum pseudolikelihood estimators of random graph models with dependent edges and parameter vectors of increasing dimension, under weak dependence and smoothness conditions. We showcase consistency results and convergence rates by using generalized β-models with dependent edges and parameter vectors of increasing dimension, in dense- and sparse-graph settings. The talk concludes with a discussion of potential future work and extensions. The primary results presented in this talk assume a complete observation of the random graph is observed. We will discuss how the theoretical developments presented in this talk offer avenues to advance the challenging topic of subgraph-to-graph estimation and inference, which considers estimating a random graph model based only on an observed subgraph.
Title: Brain Connectivity Alternation Detection via Matrix-variate Differential Network Model
Abstract: Brain functional connectivity reveals the synchronization of brain systems through correlations in neurophysiological measures of brain activities. Growing evidence now suggests that the brain connectivity network experiences alterations with the presence of numerous neurological disorders, thus differential brain network analysis may provide new insights into disease pathologies. The data from neurophysiological measurement are often multi-dimensional and in a matrix form, posing a challenge in brain connectivity analysis. Existing graphical model estimation methods either assume a vector normal distribution that in essence requires the columns of the matrix data to be independent, or fail to address the estimation of differential networks across different populations. To tackle these issues, we propose an innovative Matrix-Variate Differential Network (MVDN) model. We exploit the D-trace loss function and a Lasso-type penalty to directly estimate the spatial differential partial correlation matrix, and use an ADMM algorithm for the optimization problem. Theoretical and simulation studies demonstrate that MVDN significantly outperforms other state-of-the-art methods in dynamic differential network analysis. We illustrate with a functional connectivity analysis of an Attention Deficit Hyperactivity Disorder (ADHD) dataset. The hub nodes and differential interaction patterns identified are consistent with existing experimental studies.