| December 2, 2011 | |
| Dr. Ji Zhu | |
| Joint Estimation of Multiple Graphical Models | |
| December 2, 2011 10:05 am | |
| OSB 110 | |
| Gaussian graphical models explore dependence relationships between random variables, through estimation of the corresponding inverse covariance matrices. In this paper we develop an estimator for such models appropriate for data from several graphical models that share the same variables and some of the dependence structure. In this setting, estimating a single graphical model would mask the underlying heterogeneity, while estimating separate models for each category does not take advantage of the common structure. We propose a method which jointly estimates the graphical models corresponding to the different categories present in the data, aiming to preserve the common structure, while allowing for differences between the categories. This is achieved through a hierarchical penalty that targets the removal of common zeros in the inverse covariance matrices across categories. We establish the asymptotic consistency and sparsity of the proposed estimator in the high-dimensional case, and illustrate its superior performance on a number of simulated networks. An application to learning semantic connections between terms from webpages collected from computer science departments is also included. This is joint work with Jian Guo, Elizaveta Levina, and George Michailidis. | |
| Back To Top | |
| November 18, 2011 | |
| Dr. Kshitij Khare | |
| Cholesky based estimation in graphical models. | |
| November 18, 2011 10:05 am | |
| OSB 110 | |
| We consider the problem of sparse covariance estimation in high dimensional settings using graphical models. These models can be represented in terms of a graph, where the nodes represent random variables and edges represent their interactions. When the random variables are jointly Gaussian distributed, the lack of edges in such graphs can be interpreted as conditional and/or marginal independencies between these variables. We present a computationally efficient approach for high dimensional sparse covariance estimation in graphical models based on the Cholesky decomposition of the covariance matrix or its inverse. The proposed method is illustrated on both simulated and real data. | |
| Back To Top | |
| November 8, 2011 | |
| Dr. Bertrand Clark, Professor of Statistics, Dept of Medicine and Dept of Epidemiology and Public Health Center for Computational Science Miller School of Medicine - Univ of Miami | |
| Clustering Stability: Impossibility and Possibility | |
| November 8, 2011 11:00 am | |
| 499 DSL (Dirac Science Library) | |
| In the first part of this talk we present a theorem that gives conditions under which high dimensional clustering is unstable. Specifically, for any fixed sample size, clustering becomes impossible (in a squared error sense) as the dimension increases unless the separation among the clusters is large enough in the sense that coordinatewise differences do not decrease too quickly with $D$, the dimension of the data points. We also show that clustering impossibility occurs with a theoretical rate of ${\cal{O}}(\sqrt{D})$. In the second part of this talk we present a Bayesian method for assessing clustering stability. Roughly, the idea is to evaluate the probability that the distances between points and cluster centers can be re-ordered by random factors. The method seems to be consistent for choosing the number of clusters and we argue that it accurately reflects what we mean by the stability of a clustering. This work is ongoing research and hence comments and discussion are particularly welcome. | |
| Back To Top | |
| November 4, 2011 | |
| Dr. Giray Okten | |
| Putting randomness back in quasi-Monte Carlo | |
| November 4, 2011 10:05 am | |
| OSB 110 | |
| Abstract: The quasi-Monte Carlo method is often described as the deterministic version of the Monte Carlo method. It was developed in the last few decades, and its main advantage over Monte Carlo is faster convergence, at least asymptotically, and deterministic error bounds. In quasi-Monte Carlo one uses the so called low-discrepancy sequences to sample from a function, somewhat similar to the way pseudorandom numbers are used in Monte Carlo. Some mathematicians have even suggested avoiding the use of pseudorandom numbers altogether in favor of low-discrepancy sequences, since the former does not have a "rigorous" definition. Despite having certain advantages, the quasi-Monte Carlo method has also some drawbacks. In this talk I will give a survey of hybrid methods: these are methods that bring randomness back in quasi-Monte Carlo, in order bring together the best features of Monte Carlo and quasi-Monte Carlo. | |
| Back To Top | |
| October 28, 2011 | |
| Dr. Howard Bondell | |
| Efficient Robust Estimation via Two-Stage Generalized Empirical Likelihood | |
| October 28, 2011 10:05 am | |
| OSB 110 | |
| The triumvirate of outlier resistance, distributional robustness, and efficiency in both small and large samples, constitute the Holy Grail of robust statistics. We show that a two-stage procedure based on an initial robust estimate of scale followed by an application of generalized empirical likelihood comes very close to attaining that goal. The resulting estimators are able to attain full asymptotic efficiency at the Normal distribution, while simulations point to the ability to maintain this efficiency down to small sample sizes. Additionally, the estimators are shown to have the maximum attainable finite-sample replacement breakdown point, and thus remain stable in the presence of heavy-tailed distributions and outliers. Although previous proposals with full asymptotic efficiency exist in the literature, their finite sample efficiency can often be low. The method is discussed in detail for linear regression, but can be naturally extended to other areas, such as multivariate estimation of location and covariance. | |
| Back To Top | |
| October 21, 2011 | |
| Dr. Jinfeng Zhang | |
| Integrated Bio-entity Network: A System for Biological Knowledge Discovery | |
| October 21, 2011 10:05 am | |
| OSB 110 | |
| A significant part of our biological knowledge is centered on relationships between biological entities (bio-entities) such as proteins, genes, small molecules, pathways, gene ontology (GO) terms and diseases. Accumulated at an increasing speed, the information on bio-entity relationships is archived in different forms at scattered places. Most of such information is buried in scientific literature as unstructured text. Organizing heterogeneous information in a structured form not only facilitates study of biological systems using integrative approaches, but also allows discovery of new knowledge in an automatic and systematic way. In this study, we performed a large scale integration of bio-entity relationship information from both databases containing manually annotated, structured information and automatic information extraction of unstructured text in scientific literature. The relationship information is organized in a graph data structure, named integrated bio-entity network (IBN), where the vertices are the bio-entities and edges represent their relationships. Uncertainties associated with each edge in IBN are quantified by the probabilities inferred using statistical machine learning methods. Under this framework, probabilistic-based graph theoretic algorithms can be designed to perform various knowledge discovery tasks. We designed breadth-first search with pruning (BFSP) and most probable path (MPP) algorithms to automatically generate hypotheses—the indirect relationships with high probabilities in the network. We show that IBN can be used to generate plausible hypotheses, which not only help to better understand the complex interactions in biological systems, but also provide guidance for experimental designs. | |
| Back To Top | |
| October 14, 2011 | |
| Dr. Yiyuan She | |
| Predictive Learning Though Joint Variable Selection and Rank Reduction for High-dimensional Data | |
| October 14, 2011 10:05 am | |
| OSB 110 | |
| The talk discusses joint variable and rank selection for supervised dimension reduction in predictive learning. When the number of responses and/or that of the predictors exceed the sample size, one has to consider shrinkage methods for estimation and prediction. We propose to apply sparsity and reduced rank techniques jointly to attain simultaneous feature selection and feature extraction. A class of estimators are introduced are based on novel penalties that impose both row and rank restrictions on the coefficient matrix. We prove that these estimators adapt to the unknown matrix sparsity and have fast rates of convergence than LASSO and reduced rank regression. A computation algorithm is developed and applied to real world applications in machine learning, cognitive neuroscience and macroeconometrics forecasting. | |
| Back To Top | |
| October 7, 2011 | |
| Dr. Jonathan H. Dennis | |
| The Regulatory Organization of the Human Genome | |
| October 7, 2011 10:05 am | |
| OSB 110 | |
| A hallmark of cancer is altered chromosome structure. Consequently, the development and progression of cancer is classified by taking into account chromosomal changes that cells undergo as they become more aggressive cancers. Although there have been numerous studies on chromosomal aberrations in cancer, molecular assessment of chromosomal structure information has been understudied, and its role in malignant transformation remains poorly characterized. The human genome is organized into chromatin. The most fundamental subunit of chromatin is the nucleosome: ~150 base pairs of DNA wrapped around a “spool” of histones. We have identified chromatin-based patterns across different lung adenocarcinoma cancer grades. To address the role of chromatin structure in the progression of cancer, we compared the chromatin-structure from primary lung adenocarcinomas of grades one, two and three to their normal adjacent tissue, from several individuals, at multiple scales. We developed a systematic, robust, nucleosome distribution and chromatin accessibility microarray mapping platforms to analyze chromatin structure genome-wide across cancer grades between normal and tumor samples. We measured chromatin structure at three levels of resolution: nucleosome distribution, chromatin accessibility and three-dimensional molecular cytology. We show that grade one lung adenocarcinomas have greatly altered nucleosome distributions compared to the adjacent normal tissue, but nearly identical chromatin accessibility. Conversely, the grade three samples show extensive rearrangements in chromosomal accessibility, but only modest changes in nucleosome distribution when tumor and normal samples are compared. These data have allowed us to develop a model in which early grade lung adenocarcinomas are linked to changes in nucleosome distributions, while later grade cancers are linked to large-scale chromosomal changes. These results indicate that we should be able to use these chromatin structural changes to identify grade sub-type specific cancer biomarkers. | |
| Back To Top | |
| September 30, 2011 | |
| Dr. Hui Zou | |
| Some Results on Large Bandable Covariance Matrix Estimation | |
| September 30, 2011 10:05 am | |
| OSB 110 | |
| Covariance matrix is fundamental to many multivariate analysis techniques. In the era of high-dimensional data, estimating large covariance matrices is practically important and theoretically interesting. The first part of my talk concerns a general minimax theorem on the optimal estimation of large bandable covariance matrices. This result is a generalization of the minimax theorem obtained in Cai et al. (2010). The general minimax theorem reveals some new interesting phenomena. For example, for certain parameter spaces there is a tapering estimator that simultaneously attains the minimax optimal rates of convergence under both Frobenius and Spectral norms. For the same parameter spaces it is even possible to achieve adaptive minimax optimal estimation under the spectral norm with NP dimensions. In the second part of the talk, I will address the issue of selecting the right tapering parameter. We propose a SURE tuning method based on the Stein's Unbiased Risk Estimation theory. An extensive empirical study shows that SURE tuning is often comparable to the oracle tuning and outperforms CV. | |
| Back To Top | |
| September 23, 2011 | |
| Dr. Robert Clickner | |
| Applications of Statistics to Environmental, Health and Housing Research | |
| September 23, 2011 10:10 am | |
| OSB 110 | |
| The United States government and other governments collect and analyze data to inform public policy and programs. Generally, studies to collect and analyze data consist of one or more of the following components: research design, statistical design, methods development, implementation or data collection, data analysis, and report writing. All of these need statistics to help ensure the validity of the findings and the public policies and programs that may result. This is a review of the applications and use of statistical methods in environmental, health and housing studies, drawn on my personal experiences and those of my colleagues. Topics include the design and implementation of population-based housing and environmental studies; modeling of industrial effluents; analyses of the contributions of environmental contaminants to human body burden and health effects in the face of measurement errors, confounding variables and other data issues; and the presentation of the findings. | |
| Back To Top | |
| September 16, 2011 | |
| Dr. Adrian Barbu | |
| Hierarchical Object Parsing from Noisy Point Clouds | |
| September 16, 2011 10:10 am | |
| OSB 110 | |
| Object parsing and segmentation from point clouds are challenging tasks because the relevant data is available only as thin structures along object boundaries or other object features and is corrupted by large amounts of noise. One way to handle this kind of data is by employing shape models that can accurately follow the object boundaries. Popular models such as Active Shape and Active Appearance models lack the necessary flexibility for this task. While more flexible models such as Recursive Compositional Models have been proposed, this paper builds on the Active Shape models and makes three contributions. First, it presents a flexible, mid-entropy, hierarchical generative model of object shape and appearance in images. The input data is explained by an object parsing layer, which is a deformation of a hidden PCA shape model with Gaussian prior. Second, it presents a novel efficient inference algorithm that uses a set of informed data-driven proposals to initialize local searches for the hidden variables. Third, it applies the proposed model and algorithm to object parsing from point clouds such as edge detection images, obtaining state of the art parsing errors on two standard datasets without using any intensity information. | |
| Back To Top | |
| September 2, 2011 | |
| Dr. Victor Patrangenaru | |
| Object Data Analysis | |
| September 2, 2011 10:10 am | |
| 110 OSB | |
| Analysis of Object Data is the more traditional name for Data Analysis on Sample Spaces with a Manifold Stratification. It includes Multivariate Analysis, Directional Data Analysis, Projective Shape Analysis as well as classical Shape Analysis, Diffusion Tensor Imaging, Functional Data Analysis, Analysis of Phylogenetic Trees Data; pretty much any non-categorical statistical problem can be formulated as object data analysis problem. Much of the standard nonparametric methodology extends from the multivariate case, in the generic case when the Frechet mean of a random object (r.o.) is at a regular point. In practice there are situations when a r.o. has a mean located on the singular part of the stratified sample space, and the manifold CLT based technique break down. Our initial goal is to understand the asymptotic behavior of the estimators of the Frechet mean of an arbitrary random object, and to develop nonparametric methodologies and fast inference techniques in applications. | |
| Back To Top | |
| August 25, 2011 | |
| Rommel Bain | |
| Monte Carlo Likelihood Estimation for Conditional Autoregressive Models with Application to Sparse Spatiotemporal Data | |
| August 25, 2011 2:00 pm | |
| 108 OSB | |
| Spatiotemporal modeling is increasingly used in a diverse array of fields, such as ecology, epidemiology, health care research, transportation, economics, and other areas where data arise from a spatiotemporal process. Often analyses of spatiotemporal models are fraught with problems such as missing data and computational complexity. In this paper, a Monte Carlo likelihood (MCL) method is introduced for the analysis of sparse spatiotemporal temporal data (monthly mean zooplankton biomass) collected on a spatiotemporal lattice by the California Cooperative Oceanic Fisheries Investigations (CalCOFI) and assumed to follow a log-normal distribution. A conditional autoregressive (CAR) model is used to allow for spatiotemporal dependencies between nearest neighbor sites on the spatiotemporal lattice. Typically, CAR model likelihood inference is quite complicated because of the intractability of the CAR model’s normalizing constant. Monte Carlo likelihood estimation provides an approximation for intractable likelihood functions. We illustrate MCL parameter estimation by computing log normalized monthly mean (small) zooplankton displacement volume in ml/1000m3 to describe zooplankton seasonal variations for the CalCOFI time series. | |
| Back To Top | |
| August 23, 2011 | |
| Jihyung Shin | |
| Mixed-effects and mixed-distribution models for count data with applications to educational research data | |
| August 23, 2011 9:30 am | |
| 108 OSB | |
| This research is motivated by an analysis of reading and vocabulary data collected by Florida Center for Reading Research. We are interested in modeling the outcome of reading ability of kindergarten children aged between 5 and 7. With consents of both parents and teacher, data was collected from 461 students of number of letters with correct pronunciation in sixty seconds time period. The test has been conducted three times over the academic year, Fall, Winter and Spring. The data showed excessive zero scores on the test. In this dissertation, we examine zero-inflated Poisson (ZIP) regression models and mixed-effects and mixed-distribution models (MEMD) proposed by Lambert(1992) and Tooze(2002) respectively. The MEMD model is extended to Poisson count data in longitudinal setting. The maximum likelihood estimation is obtained through standard statistical software package. The application result is also shown. | |
| Back To Top | |
| August 15, 2011 | |
| Wei Liu | |
| A RIEMANNIAN FRAMEWORK FOR ANNOTATED CURVES ANALYSIS | |
| August 15, 2011 9:30 am | |
| 108 OSB | |
| We propose a Riemannian framework for shape analysis of annotated curves, curves that have certain attributes defined along them, in addition to their geometries. These attributes may be in form of vector-valued functions, discrete landmarks, or symbolic labels, and provide auxiliary information along the curves. The resulting shape analysis, that is comparing, matching, and deforming, is naturally influenced by the auxiliary functions. Our idea is to construct curves in higher dimensions using both geometric and auxiliary coordinates, and analyze shapes of these curves. The difficulty comes from the need for removing different groups from different components: the shape is invariant to rigid-motion, global scale and re-parameterization while the auxiliary component is usually invariant only to the re-parameterization. Thus, the removal of some transformations (rigid motion and global scale) is restricted only to the geometric coordinates, while the re-parameterization group is removed for all coordinates. We demonstrate this framework using a number of experiments. | |
| Back To Top | |
| August 3, 2011 | |
| Sentibaleng Ncube | |
| A Novel Riemannian Metric for Analyzing Spherical Functions with Applications to HARDI Data | |
| August 3, 2011 9:30 am | |
| 108 OSB | |
| We propose a novel Riemannian framework for analyzing orientation distribution functions (ODFs), or their probability density functions (PDFs), in HARDI data sets for use in comparing, interpolating, averaging, and denoising PDFs. This is accomplished by separating shape and orientation features of PDFs, and then analyzing them separately under their own Riemannian metrics. We formulate the action of the rotation group on the space of PDFs, and define the shape space as the quotient space of PDFs modulo the rotations. In other words, any two PDFs are compared in: (1) shape by rotationally aligning one PDF to another, using the Fisher-Rao distance on the aligned PDFs, and (2) orientation by comparing their rotation matrices. This idea improves upon the results from using the Fisher-Rao metric in analyzing PDFs directly, a technique that is being used increasingly, and leads to geodesic interpolations that are biologically feasible. This framework leads to definitions and efficient computations for the Karcher mean that provide tools for improved interpolation and denoising. We demonstrate these ideas, using an experimental setup involving several PDFs. | |
| Back To Top | |
| July 29, 2011 | |
| Emilola Abayomi | |
| The Relationship of Body Weight to Blood Pressure in Diverse Populations | |
| July 29, 2011 10:00 am | |
| 108 OSB | |
| High blood pressure is a major determinant of risk for Coronary Heart Disease (CHD) and stroke, leading causes of death in the industrialized world. A myriad of pharmacological treatments for elevated blood pressure, defined as a blood pressure greater than 140/90mmHg, are available and have at least partially resulted in large reductions in the incidence of CHD and stroke in the U.S. over the last 50 years. The factors that may increase blood pressure levels are not well understood, but body fat is thought to be a major determinant of blood pressure level. Obesity is measured through various methods (skinfolds, waist-to-hip ratio, bioelectrical impedance analysis (BIA), etc.), but the most commonly used measure is body mass index, ?. Although the relationship between level of blood pressure and BMI has been extensively reported, several questions remain: Is there a significant relationship between blood pressure and body mass index in all populations? Is the relationship between body mass and blood pressure linear? How does the relationship vary in different populations? Do characteristics such as race and gender explain heterogeneity that maybe present amongst the relationship in diverse populations? How does the relationship of other measures of body fat (skinfolds, waist-to-hip ratio, etc.) compare to the relationships found for BMI in diverse populations? To examine these questions we will conduct a meta-analysis based on person-level data from almost 30 observational studies from around the world. | |
| Back To Top | |
| July 28, 2011 | |
| Felicia Williams | |
| The Relationship of Diabetes and Coronary Heart Disease Mortality: a meta- analysis based on person level data | |
| July 28, 2011 10:00 am | |
| 108 OSB | |
| Studies have suggested that diabetes is a stronger risk factor for coronary heart disease (CHD) in women than in men. We present a meta-analysis of person-level data from diverse populations to examine this issue. Our data comes from 18 studies in which diabetes, CHD mortality and potential confounders were available and a minimum of 75 CHD deaths occurred. These studies followed up 69,308 men and 74,735 women aged 42 to 73 years on average from the US, Denmark, Iceland, Norway and the UK. Individual study prevalence rates of diabetes mellitus (mostly self-reported) at baseline ranged between less than 1\% in the youngest cohort and 15.7% (males) and 11.1% (females) in the NHLBI Cardiovascular Health Study (CHS) of the elderly. CHD death rates varied between 2% and 20%. Hazard ratios (HR) associated with baseline diabetes, adjusted for age only, varied between 1.06 and 5.12 (males) and between 1.42 and 10.87 (females). Hazard ratios (HR) associated with baseline diabetes, adjusted for age, serum cholesterol level , systolic blood pressure, and cigarette smoking status varied between 1.10 and 6.69 (males) and between 1.25 and 9.01 (females). The male fixed-effect estimated HR of fatal CHD in diabetic versus non-diabetic participants after adjustment for the major risk factors was 2.40 (95% CI 2.11-2.73) whereas the corresponding HR for females was 2.89 (2.50-3.34). These estimates differed only slightly from unadjusted ones obtained from the same data [males: 2.43 (2.14, 2.75) p-value>0.25 and females: 2.91 (2.52, 3.35) p-value>0.25]. They agree closely with estimates (odds ratios of 2.3 for males and 2.9 for females) obtained in a recent meta-analysis of 8 studies of both fatal and nonfatal CHD but based on literature-based data. There are insufficient data to suggest that there is a difference in the models that adjust for additional major CHD risk factors and the models that are unadjusted. | |
| Back To Top | |
| June 9, 2011 | |
| Lindsey Bell | |
| A STATISTICAL APPROACH FOR INFORMATION EXTRACTION OF BIOLOGICAL RELATIONSHIPS | |
| June 9, 2011 10:00 am | |
| HCB 207 | |
| Vast amounts of biomedical information are stored in scientic literature, easily accessed through publicly available databases. Relationships among biomedical terms constitute a major part of our biological knowledge. Acquiring such structured information from unstructured literature can be done through human annotation, but is time and resource consuming. As this content continues to rapidly grow, the popularity and importance of text mining for obtaining information from unstructured text becomes increasingly evident. Text mining has four major components. First relevant articles are identied through information retrieval (IR), next important concepts and terms are agged using entity recognition(ER), and then facts concerning these entities are extracted from the literature in a process called information extraction(IE). Finally, text mining takes these elements and seeks to synthesize new information from the literature. Our goal is information extraction from unstructured literature concerning biological entities. To do this, we use the structure of triplets where each triplet contains two biological entities and one interaction word. The biological entities may include terms such as protein names, disease names, genes, and small-molecules. Interaction words describe the relationship between the biological terms. Under this framework we aim to combine the strengths of three classiers in an ensemble approach. The three classiers we consider are Bayesian Networks, Support Vector Machines, and mixture of logistic models dened by interaction word. The three classiers and ensemble approach are evaluated on three benchmark corpora and one corpus that is introduced in this study. The evaluation includes cross validation and cross-corpus validation to replicate an application scenario. The three classiers are unique and we nd that performance of individual classiers varies depending on the corpus. Therefore, an ensemble of classiers removes the need to choose one classier and provides optimal performance. | |
| Back To Top | |
| June 7, 2011 | |
| Greg Miller | |
| INVESTIGATING THE USE OF MORTALITY DATA AS A SURROGATE FOR MORBIDITY DATA | |
| June 7, 2011 10:00 am | |
| Bel 001 | |
| We are interested in dierences between risk models based on Coronary Heart Disease (CHD) incidence, or morbidity, compared to risk models based on CHD death. Risk models based on morbidity have been developed based on the Framingham Heart Study, while the European SCORE project developed a risk model for CHD death. Our goal is to determine whether those two developed models dier in treatment decisions concerning patient heart health. We begin by reviewing recent metrics in surrogate variables and prognostic model performance. We then conduct bootstrap hypotheses tests between two Cox proportional hazards models using Framingham data, one with incidence as a response, and one with death as a response, and nd that the coecients dier for the age covariate, but no signicant dierence for other risk factors. To understand how surrogacy can be applied to our case, where the surrogate variable is nested within the true variable of interest, we examine models based on a composite event compared to models based on singleton events. We also conduct a simulation, simulating times to a CHD incidence and time from CHD incidence to CHD death, censoring at 25 years to emulate the end of a study. We compare a Cox model with death response with a Cox model based on incidence using bootstrapped condence intervals, and nd that age and systolic blood pressure have dierences with their covariates. We continue the simulation by using Net Reclassication Index (NRI) to evaluate the treatment decision performance of the two models, and that the two models do not perform signicantly different in correctly classifying events, if the decisions are based on the risk ranks of the individuals. As long as the relative order of patients' risks is preserved across dierent risk models, treatment decisions based on classifying an upper specied percent as high risk will not be signicantly different. We conclude the dissertation with statements about future methods of approaching our question. | |
| Back To Top | |
| June 2, 2011 | |
| Robert Holden | |
| Failure Time Regression Models for Thinned Point Processes | |
| June 2, 2011 10:00 am | |
| BEL 006 | |
| In survival analysis, data on the time until a specic criterion event occurs are analyzed, often with regard to the effects of various predictors. In the classic applications, the criterion event is in some sense a terminal event, e.g., death of a person or failure of a machine or machine component. In these situations, the analysis requires assumptions only about the distribution of waiting times until the criterion event occurs and the nature of the effects of the predictors on that distribution. Suppose that the criterion event isn't a terminal event that can only occur once, but is a repeatable event. The sequence of events forms a stochastic point process. Further suppose that only some of the events are detected (observed); the detected events form a thinned point process. Any failure time model based on the data will be based not on the time until the first occurrence, but on the time until the first detected occurrence of the event. I will consider the implications of this for survival regression models. Such models will have little meaning unless the regression parameters are independent of the detection probability, or, more generally, the thinning mechanism. I will show that the effect of thinning on regression parameters depends on the combination of the type of regression model and the type of point process that generates the events. For some combinations, the effect of a predictor will be the same for time to the rst event and the time to the first detected event. For other combinations, the regression effect will be changed as a result of the incomplete detection. | |
| Back To Top | |
| May 31, 2011 | |
| Jennifer Geis | |
| A WEIGHTED APPROACH TO RANK SELECTION WITH A DATA-ADAPTIVE METHOD TO CANONICAL CORRELATION ANALYSIS | |
| May 31, 2011 1:00 pm | |
| BEL 007 | |
| Back To Top | |
| May 26, 2011 | |
| Leif Ellingson | |
| STATISTICAL SHAPE ANALYSIS ON MANIFOLDS WITH APPLICATIONS TO PLANAR CONTOURS AND STRUCTURAL PROTEOMICS | |
| May 26, 2011 10:00 am | |
| 006 BEL | |
| The technological advances in recent years have produced a wealth of intricate digital imaging data that is analyzed effectively using the principles of shape analysis. Such data often lies on either high-dimensional or infinite-dimensional manifolds. With computing power also now strong enough to handle this data, it is necessary to develop theoretically-sound methodology to perform the analysis in a computationally efficient manner. In this dissertation, we propose approaches of doing so for planar contours and the three-dimensional atomic structures of protein binding sites. First, we adapt Kendallfs definition of direct similarity shapes of finite planar configurations to shapes of planar contours under certain regularity conditions and utilize Ziezoldfs nonparametric view of FrLechet mean shapes. The space of direct similarity shapes of regular planar contours is embedded in a space of Hilbert-Schmidt operators in order to obtain the Veronese-Whitney extrinsic mean shape. For computations, it is necessary to use discrete approximations of both the contours and the embedding. For cases when landmarks are not provided, we propose an automated, randomized landmark selection procedure that is useful for contour matching within a population and is consistent with the underlying asymptotic theory. For inference on the extrinsic mean direct similarity shape, we consider a one-sample neighborhood hypothesis test and the use of nonparametric bootstrap to approximate confidence regions. Bandulasiri et al (2008) suggested using exrinsic reflection size-and-shape analysis to study the relationship between the structure and function of protein binding sites. In order to obtain meaningful results for this approach, it is necessary to identify the atoms common to a group of binding sites with similar functions and obtain proper correspondences for these atoms. We explore this problem in depth and propose an algorithm for simultaneously finding the common atoms and their respective correspondences based upon the Iterative Closest Point algorithm. For a benchmark data set, our classification results compare favorably with those of leading established methods. Finally, we discuss current directions in the field of statistics on manifolds, including a computational comparison of intrinsic and extrinsic analysis for various applications and a brief introduction of sample spaces with manifold stratification. | |
| Back To Top | |
| May 19, 2011 | |
| Jianchang Lin | |
| Semiparametric Bayesian survival analysis via transform-both-sides model | |
| May 19, 2011 1:00 pm | |
| BEL 007 | |
| We propose a new semiparametric survival model with a log-linear median regression function as an useful alternative to the popular Cox's (1972) models and linear transformation models (Cheng et al., 1995). Compared to existing semiparametric models, our models have many practical advantages, including the interpretation of regression parameters via median, ability to incorporate heteroscedasticity, the ease of prior elicitation and computation of Bayesian estimators. Our Bayesian estimation method is also extended to multivariate survival model with symmetric random effects distribution. Our multivariate survival model has same covariate effects on marginal (population average) as well as conditional (given random effects) median survival time. Our other aim is to develop a Bayesian simultaneous variable selection and estimation of median regression for skewed response variable. Our hierarchical Bayesian model can incorporates advantages of Lasso penalty albeit for skewed and heteroscedastic response variable. Some preliminary simulation studies have been conducted to compare the performance of proposed model and existing frequentist median lasso. Considering the estimation bias and Mean squared error, our proposed model performs as good as and, in some scenarios, better than competing frequentist estimators. We illustrate our approaches and model diagnostics via reanalysis of some real life clinical studies including a small-cell lung cancer study and a retinopathy study. | |
| Back To Top | |
| May 19, 2011 | |
| Daniel Osborne | |
| Nonparametric Data Analysis on Manifolds with an Application in Medical Imaging | |
| May 19, 2011 10:00 am | |
| 006 BEL | |
| Over the past fifteen years, there has been a rapid development in Nonparametric Statistical Analysis on Shape Manifolds applied to Medical Imaging. For surgery planning, a more appropriate approach is to take into account the size as well when analyzing the CT scan data. In this context, one performs a nonparametric analysis on the 3D data retrieved from CT scans of healthy young adults, on the Size-and-Reflection Shape Space of k-ads in general position in 3D. This work, as part of larger project on planning reconstructive surgery in severe skull injuries, includes preprocessing and post-processing steps of CT images. The preprocessing step, consists of the extraction the boundary of the bone structure from the CT slices while the post-processing steps consists of 3D reconstruction of the virtual skull from these bone extractions and smoothing. Next we present preliminary results for the Schoenberg’s sample mean Size-and-Reflection Shape of k-ads in general position in R^3 for the human skull based on these virtual reconstructions. The bootstrap distribution of the Schoenberg sample means 3D Size-and-Reflection Shape for a selected group of anatomic landmarks and pseudo-landmarks, are computed for 500 bootstrap resamples of the original 20 skulls represented by the 3 by k configurations, when k=9. Finally, we report a confidence region for the Schoenberg mean configuration. | |
| Back To Top | |
| May 16, 2011 | |
| Tamika Royal-Thomas | |
| Interrelating of Longitudinal Processes: An Empirical Example | |
| May 16, 2011 9:30 am | |
| 207 HCB | |
| The Barker Hypothesis states that maternal and `in utero' attributes during pregnancy affects a child's cardiovascular health throughout life. We present an analysis of a unique longitudinal dataset from Jamaica that consists of three longitudinal processes: (i) Maternal longitudinal process- Blood pressure and anthropometric measurements at seven time-points on the mother during pregnancy. (ii)In Utero measurements - Ultrasound measurements of the fetus taken at six time-points during pregnancy. (iii)Birth to present process - Children's anthropometric and blood pressure measurements at 24 time-points from birth to 14 years. A comprehensive analysis of the interrelationship of these three longitudinal processes is presented using joint modeling for multivariate longitudinal profiles. We propose a new methodology of examining child's cardiovascular risk by extending a current view of likelihood estimation. Joint modeling of multivariate longitudinal profiles is done and the extension of the traditional likelihood method is utilized in this paper and compared to the maximum likelihood estimates. Our main goal is to examine whether the process in mothers predicts fetal development which in turn predicts the future cardiovascular health of the children. One of the difficulties with `in utero' and early childhood data is that certain variables are highly correlated and so using dimension reduction techniques are quite applicable in this scenario. Principal component analysis (PCA) is utilized in creating a smaller dimension of uncorrelated data which is then utilized in a longitudinal analysis setting. These principal components are then utilized in an optimal linear mixed model for longitudinal data which indicates that in utero and early childhood attributes predicts the future cardiovascular health of the children. This thesis has added a body of knowledge to developmental origins of adult diseases and has supplied some significant results while utilizing a rich diversity of statistical methodologies. | |
| Back To Top | |
| May 2, 2011 | |
| Yinfeng Tao | |
| Title: "The frequentist properties and general performance of Bayesian Confidence Intervals for the survival function." | |
| May 2, 2011 11:00 am | |
| 108 OSB | |
| Estimation of a survival function is a very important topic in survival analysis with contributions from many authors. We consider estimation of confidence intervals for the survival function based on right censored or interval-censored survival data. In the right-censored case, almost all confidence intervals are based in some way on the Kaplan-Meier estimator first proposed by Kaplan and Meier (1958) and widely used As the nonparametric estimator in the presence of right-censored data. For interval-censored data, the Turnbull estimator (Turnbull(1974)) plays a similar role. For a class of Bayesian models involving mixtures of Dirichlet priors, Doss and Huffer (2003) suggested several simulation techniques to approximate the posterior distribution of the survival function by using Markov chain Monte Carlo or sequential importance sampling. These techniques can lead to point estimates and probability intervals for the survival function at arbitrary time points for both the right-censored and interval-censored cases. The main objective of this thesis is to examine the frequentist properties and general performance of the Bayesian probability intervals when the prior is non-informative. Simulation studies will be used to compare these Bayesian probability intervals based on Doss & Huffer's approach with other published methods for obtaining pointwise confidence intervals for the survival function. Similar comparisons will be carried out for confidence intervals for quantiles of the survival function. Also we describe an approach for constructing simultaneous confidence bands for the survival function which will be investigated in future work. | |
| Back To Top | |
| April 29, 2011 | |
| Paul Hill | |
| Bootstrap Prediction Bands for Non-Parametric Function Signals in a Complex System | |
| April 29, 2011 10:00 am | |
| 307 HCB | |
| Methods employed in the construction of prediction bands for continuous curves require a different approach to those used for a data point. In many cases, the underlying function is unknown and thus a distribution-free approach which preserves sufficient coverage for the signal in its entirety is necessary in the signal analysis. Three methods for the formation of (1-?)100% prediction bands are presented and their performances are compared through the coverage probabilities obtained. These techniques are applied to constructing prediction bands for spring discharge in a successful manner giving good coverage in each case. Spring discharge measured over time can be considered as a continuous signal and the ability to predict the future signals of spring discharge is useful for monitoring flow and other issues related to the spring. There has been common use of the gamma distribution in the simulation of rainfall. Bootstrapping the rainfall in the proposed manner, allows for adequately creating new samples over different periods of time as well as specific rain events such as hurricanes or drought. This non-parametric approach to the input rainfall augurs well for the non-parametric nature of the output signal. | |
| Back To Top | |
| March 25, 2011 | |
| Yu Gu | |
| New Semiparametric Methods for Recurrent Events Data" and here is the abstract. | |
| March 25, 2011 2:00 pm | |
| HCB 210 | |
| Recurrent events data are rising in all areas of biomedical research. We present a model for recurrent events data with the same link for the intensity and mean functions. Simple interpretations of the covariate effects on both the intensity and mean functions lead to a better understanding of the covariate effects on the recurrent events process. We use partial likelihood and empirical Bayes methods for inference and provide theoretical justifications and as well as relationships between these methods. We also show the asymptotic properties of the empirical Bayes estimators. We illustrate the computational convenience and implementation of our methods with the analysis of a heart transplant study. We also propose an additive regression model and associated empirical Bayes method for the risk of a new event given the history of the recurrent events. Both the cumulative mean and rate functions have closed form expressions for our model. Our inference method for the simiparametric model is based on maximizing a finite dimensional integrated likelihood obtained by integrating over the nonparametric cumulative baseline hazard function. Our method can accommodate time-varying covariates and is easier to implement computationally instead of iterative algorithm based full Bayes methods. The asymptotic properties of our estimates give the large-sample justifications from a frequentist stand point. We apply our method on a study of heart transplant patients to illustrate the computational convenience and other advantages of our method. | |
| Back To Top | |
| March 24, 2011 | |
| Vernon Lawhern | |
| "Statistical Modeling and Applications of Neural Spike Trains" | |
| March 24, 2011 3:35 pm | |
| 110 OSB | |
| Understanding how spike trains encode information is a principle question in the study of neural activity. Recent advances in biotechnology have given researchers the ability to record neural activity on a wide scale, allowing researchers to perform detailed analyses that may have been impossible just a few years ago. Here we present several frameworks for the statistical modeling of neural spike trains. We first develop a Generalized Linear Model (GLM) framework that incorporates the effects of hidden states in the modeling of neural activity in the primate motor cortex. We then develop a state-space model that incorporates target information in the modeling framework. In both cases, significant improvements in model fitting and decoding accuracy were observed. Finally, in joint work with Dr. Contreras and Dr. Nikonov from the Psychology Department, we study taste coding and discrimination in the gustatory system by using information-theoretic tools such as Mutual Information, and by using a recently-developed spike train metric to study the clustering performance from recordings of proximate neurons. | |
| Back To Top | |
| March 16, 2011 | |
| Anqi Tang | |
| A CLASS OF MIXED-DISTRIBUTION MODELS WITH APPLICATIONS IN FINANCIAL DATA ANALYSIS | |
| March 16, 2011 9:30 am | |
| 499 DSL | |
| Statisticians often encounter data in the form of a combination of discrete and continuous outcomes. A special case is zero-inflated longitudinal data where the response variable has a large portion of zeros. These data exhibit correlation because observations are obtained on the same subjects over time. In this dissertation, we propose a two-part mixed distribution model to model zero-inflated longitudinal data. The first part of the model is a logistic regression model that models the probability of nonzero response; the other part is a linear model that models the mean response given that the outcomes are not zeros. Random effects with AR(1) covariance structure are introduced into the both parts of the model to allow serial correlation and subject specific effect. Estimating the two-part model is challenging because of high dimensional integration necessary to obtain the maximum likelihood estimates. We propose a Monte Carlo EM algorithm for estimating the maximum likelihood estimates of parameters. Through simulation study, we demonstrate the good performance of the MCEM method in parameter and standard error estimation. To illustrate, we apply the two-part model with correlated random effects and the model with autoregressive random effects to executive compensation data to investigate potential determinants of CEO stock option grants. | |
| Back To Top | |
| February 25, 2011 | |
| Wei Wu | |
| February 25, 2011 10:10 am | |
| Back To Top | |
| February 18, 2011 | |
| Adrian Barbu | |
| Automatic Detection and Segmentation of Lymph Nodes from CT Data | |
| February 18, 2011 10:10 am | |
| 108 OSB | |
| Lymph nodes are assessed routinely in clinical practice and their size is followed throughout radiation or chemotherapy to monitor the effectiveness of cancer treatment. This work presents a robust learning-based method for automatic detection of solid lymph nodes from CT data, with the following contributions. First, it presents a learning based approach to solid lymph node detection that relies on Marginal Space Learning to achieve great speedup with virtually no loss in accuracy. Second, it presents an efficient segmentation method for solid lymph nodes. Third, it introduces two new sets of features that are effective for LN detection, one that self-aligns to high gradients and another set obtained from the segmentation result. The method is evaluated on large datasets obtaining better than state of the art results, with a running time of 5-40 seconds per volume. An added benefit of the method is the capability to detect and segment conglomerated lymph nodes. | |
| Back To Top | |
| February 11, 2011 | |
| Feng Zhao | |
| Bayesian portfolio optimization with time-varying factor models | |
| February 11, 2011 10:10 am | |
| 108 OSB | |
| We develop a modeling framework to simultaneously evaluate various types of predictability in stock returns, including stocks' sensitivity (\betas") to systematic risk factors, stocks' abnormal returns unexplained by risk factors (\alphas"), and returns of risk factors in excess of the risk-free rate (\risk premia"). Both | |
| Back To Top | |
| January 28, 2011 | |
| Anuj Srivastava | |
| Statistical Modeling of Elastic Functions | |
| January 28, 2011 10:10 am | |
| 108 OSB | |
| Motivated by the well-known problem of finding spurts in Berkeley growth data, we are interested in modeling functions that allow some warping in the time domains. Such warping is useful in improving the matching of peaks and valleys across functions and results in models that better preserve structures in the original data. The challenge is to develop a principled approach that can automatically warp a given set functions to result in an optimal alignment. The aligned functions are said to represent the "y-variability" and the warping functions used to align them form the "x-variability" of the data. I will start by summarizing the main ideas used in the past literature (e.g. Kneip and Gasser, Annals, 1992; Ramsay and Li, JRSSB, 1998; Kneip and Ramsay, JASA 2008; Liu and Mueller, JASA, 2004) and what I view as their limitations. Then, I will describe our approach for (1) aligning, comparing and modeling functions and (2) modeling the warping functions. This framework is based on the use of the Fisher-Rao Riemannian metric that provides a proper distance for comparing time-warped functional data. These distances are then used to define Karcher means and the individual functions are optimally warped to align them to the Karcher means to extract the y variability. Principal component analysis and stochastic modeling of these constituents x and y in their respective spaces leads to the desired modeling of functional variation. These ideas are demonstrated using both simulated and real data from different application domains: the Berkeley growth study, handwritten signature curves, and neuroscience spike trains. (Collaborators: Wei Wu, Sebastian Kurtek, Eric Klassen, and J. Steve Marron (UNC, Chapel Hill)) | |
| Back To Top | |
| January 21, 2011 | |
| Jim Berger of Duke University | |
| "I don't know where I'm gonna go when the volcano blows" | |
| January 21, 2011 10:10 am | |
| 108 OSB | |
| wrote Jimmy Buffet. Great song line, but usually it's too late to go when the volcano blows; one has to know when to go before the volcano blows. The problem of risk assessment for rare natural hazards -- such as volcanic pyroclastic flows -- is addressed, and illustrated with the Soufriere Hills Volcano on the island of Montserrat. Assessment is approached through a combination of mathematical computer modeling, statistical modeling of geophysical data, and extreme-event probability computation. A mathematical computer model of the natural hazard is used to provide the needed extrapolation to unseen parts of the hazard space. Statistical modeling of the available geophysical data is needed to determine the initializing distribution for exercising the computer model. In dealing with rare events, direct simulations involving the computer model are prohibitively expensive, so computation of the risk probabilities requires a combination of adaptive design of computer model approximations (emulators) and rare event simulation. | |
| Back To Top | |