|
|
|
Colloquia
|
| December 13, 2012, 10:00 am |
Rommel Bain Department of Statistics, Florida State University, Dissertation Defense |
| December 3, 2012, 1:00 pm |
Yuanyuan Tang, Department of Statistics, Florida State University, Essay Defense |
| November 30, 2012, 2:00 pm |
Seungyeon Ha, Department of Statistics Florida State University, Essay Defense |
| November 30, 2012, 10:00 am |
Yiyuan She, Department of Statistics, Florida State University |
| November 16, 2012, 10:00 am |
Jiashun Jin, Department of Statistics, Carnegie Mellon University |
| November 9, 2012, 10:00 am |
Ming Yuan, School of Industrial & Systems Engineering, Georgia Tech |
| November 7, 2012, 3:35 pm |
David Bristol, Statistical Consulting Services, Inc. |
| November 2, 2012, 10:00 am |
Jinfeng Zhang, Department of Statistics, FSU |
| October 30, 2012, 10:00 am |
Steve Chung, Ph.D. Candidate |
| October 29, 2012, 12:00 pm |
Emilola Abayomi, Ph.D Candidate, Dissertation |
| October 26, 2012, 10:00 am |
Ciprian Crainiceanu, Department of Biostatistics, Johns Hopkins University |
| October 19, 2012, 10:00 am |
Gareth James, Marshall School of Business, University of South California |
| October 12, 2012, 10:00 am |
Michelle Arbeitman, College of Medicine, FSU |
| October 5, 2012, 10:00 am |
Adrian Barbu, Dept. of Statistics, FSU |
| September 28, 2012, 10:00 am |
Vladimir Koltchinskii, Dept. of Mathematics, Georgia Tech |
| September 21, 2012, 10:00 am |
Xiaotong Shen, John Black Johnston Distinguished Professor, School of Statistics, University of Minnesota |
| September 14, 2012, 10:00 am |
Xiuwen Liu, FSU Dept. of Computer Science |
| August 9, 2012, 11:00 am |
Senthil Girimurugan |
| May 4, 2012, 10:00 am |
Jingyong Su, FSU Dept. of Statistics |
| April 27, 2012, 3:30 pm |
Ester Kim, FSU Dept of Statistics |
| April 27, 2012, 10:00 am |
Sebastian Kurtek, Ph.D Candidate, Dissertation |
| April 20, 2012, 10:00 am |
Sunil Rao, University of Miami |
| April 13, 2012, 10:00 am |
Gretchen Rivera, FSU Dept. of Statistics |
| April 6, 2012, 10:00 am |
Xu Han, University of Florida |
| March 30, 2012, 2:00 pm |
Jordan Cuevas, Ph.D Candidate, Dissertation |
| March 30, 2012, 10:00 am |
Jinfeng Zhang, FSU Dept. of Statistics |
| March 29, 2012, 2:00 pm |
Paul Hill |
| March 28, 2012, 9:00 am |
Rachel Becvarik , FSU Dept. of Statistics |
| March 27, 2012, 3:30 pm |
Jihyung Shin, FSU Dept. of Statistics |
| March 26, 2012, 1:00 pm |
Jianchang Lin |
| March 23, 2012, 10:00 am |
Bob Clickner, FSU Dept. of Statistics |
| March 16, 2012, 10:00 am |
Wei Wu, FSU Dept. of Statistics |
| March 2, 2012, 10:00 am |
Piyush Kumar, FSU Dept. of Computer Science |
| March 1, 2012, 11:00 am |
Jun Li, Dept. of Statistics, Stanford University |
| February 29, 2012, 3:30 pm |
Cun-Hui Zhang, Rutgers University Dept. of Statistics |
| February 29, 2012, 10:30 am |
Daniel Osborne, Ph.D candidate, FSU Dept. of Statistics |
| February 28, 2012, 3:30 pm |
Eric Lock, Dept of Statistics, University of North Carolina at Chapel Hill |
| February 27, 2012, 11:00 am |
Kelly McGinnity, FSU Dept. of Statistics |
| February 16, 2012, 2:00 pm |
Alec Kercheval, FSU Dept. of Mathematics |
| February 10, 2012, 3:30 pm |
Jennifer Geis, Ph.D. candidate, FSU Dept. of Statistics |
| February 10, 2012, 10:00 am |
Debdeep Pati |
| February 3, 2012, 10:00 am |
Zhihua Sophia Su |
| January 27, 2012, 10:00 am |
Harry Crane |
| January 20, 2012, 10:00 am |
Anindra Bhadra |
| January 13, 2012, 10:00 am |
Xinge Jessie Jeng |
| January 10, 2012, 3:30 pm |
Ingram Olkin |
| December 13, 2012 |
| Rommel Bain Department of Statistics, Florida State University, Dissertation Defense |
| Monte Carlo Likelihood Estimation for Conditional Autoregressive Models with Application to Sparse Spatiotemporal Data |
| December 13, 2012 10:00 am |
| OSB 215 |
|
| Spatiotemporal modeling is increasingly used in a diverse array of fields, such as ecology, epidemiology, health care research, transportation, economics, and other areas where data arise from a spatiotemporal process. Spatiotemporal models describe the relationship between observations collected from different spatiotemporal sites. The modeling of spatiotemporal interactions arising from spatiotemporal data is done by incorporating the space-time dependence into the covariance structure. A main goal of spatiotemporal modeling is the estimation and prediction of the underlying process that generates the observations under study and the parameters that govern the process. Furthermore, analysis of the spatiotemporal correlation of variables can be used for estimating values at sites where no measurements exist. In this work, we develop a framework for estimating quantities that are functions of complete spatiotemporal data when the spatiotemporal data is incomplete. We present two classes of conditional autoregressive (CAR) models (the homogeneous CAR (HCAR) model and the weighted CAR (WCAR) model) for the analysis of sparse spatiotemporal data (the log of monthly mean zooplankton biomass) collected on a spatiotemporal lattice by the California Cooperative Oceanic Fisheries Investigations (CalCOFI). These models allow for spatiotemporal dependencies between nearest neighbor sites on the spatiotemporal lattice. Typically, CAR model likelihood inference is quite complicated because of the intractability of the CAR model's normalizing constant. Sparse spatiotemporal data further complicates likelihood inference. We implement Monte Carlo likelihood (MCL) estimation methods for parameter estimation of our HCAR and WCAR models. Monte Carlo likelihood estimation provides an approximation for intractable likelihood functions. We demonstrate our framework by giving estimates for several different quantities that are functions of the complete CalCOFI time series data. |
| Back To Top |
| December 3, 2012 |
| Yuanyuan Tang, Department of Statistics, Florida State University, Essay Defense |
| Bayesian Partial Linear Model for skewed Longitudinal Data |
| December 3, 2012 1:00 pm |
| OSB 215 |
|
| For longitudinal studies with heavily skewed continuous response,
statistical model and methods focusing on mean response are not appropriate.
In this paper, we present a partial linear model of median regression
function of skewed longitudinal response. We develop a semi-parametric
Bayesian estimation procedure using an appropriate Dirichlet process mixture
prior for the skewed error distribution. We provide justifications for using
our methods including theoretical investigation of the support of the prior,
asymptotic properties of the posterior and also simulation studies of finite
sample properties. Ease of implementation and advantages of our model and
method compared to existing methods are illustrated via analysis of a
cardiotoxicity study of children of HIV infected mother.
Our other aim is to develop a Bayesian simultaneous variable selection and
estimation of median regression for skewed response variable. Some
preliminary simulation studies have been conducted to compare the
performance of proposed model and existing frequentist median lasso
regression model. Considering the estimation bias and total square error,
our proposed model performs as good as, or better than competing frequentist
estimators. |
| Back To Top |
| November 30, 2012 |
| Seungyeon Ha, Department of Statistics Florida State University, Essay Defense |
| Essay Defense |
| November 30, 2012 2:00 pm |
| 215 OSB |
|
| In this paper, the L0 regularization is proposed for estimating a sparse linear regression vector in high-dimensional setup, for the purpose of both prediction and variable selection. The oracle upper bounds of both prediction error and selection error are at the same rate of those via Lasso, even under no restriction on the design matrix. The estimation loss in Lq-norm, where q ?[1,?], is upper bounded at the optimal rate of O( ?(log??K)??^(q/2) ) under a less restricted condition RIF, proposed by Zhang and Zhang(2011). Sparsity recovery, or variable selection is our main concern and we will derive the required conditions for sign consistency, which control incoherence of design matrix and signal-to-noise rate(SNR). The L0 regularization achieves SNR of the optimal rate as O(?) but requires less restriction than Lasso does for achieving the optimal rate. Then, we extend our theorems to multivariate response model by considering grouping on univariate model. On both models we approach with hard-TISP algorithm proposed by She (2009), and we guarantee to get the same stationary points by scaling the design matrix properly. |
| Back To Top |
| November 30, 2012 |
| Yiyuan She, Department of Statistics, Florida State University |
| On the Cross-Validation for Sparse Reduced Rank Models |
| November 30, 2012 10:00 am |
| 108 OSB |
|
| Recently, the availability of high-dimensional data in statistical applications has created an urgent need for methodologies to pursue sparse and/or low rank models. These approaches usually resort to a grid search with a model comparison criterion to locate the optimal value of the regularization parameter. Cross-validation is one of the most widely used tunings in statistics and computer science. We propose a new form of cross-validation referred to as the selective-projective cross-validation (SPCV) for multivariate models where relevant features may be few and/or lie in a low dimensional subspace. In contrast to most available methods, SPCV cross-validates candidate projection-selection patterns instead of regularization parameters and is not limited to specific penalties. A further scale-free complexity correction is developed based on the nonasymptotic Predictive Information Criterion (PIC) to achieve the minimax optimal error rate in this setup. |
| Back To Top |
| November 16, 2012 |
| Jiashun Jin, Department of Statistics, Carnegie Mellon University |
| Fast Network Community Detection by SCORE |
| November 16, 2012 10:00 am |
| 108 OSB |
|
| Consider a network where the nodes split into K dierent communities. The
community labels for the nodes are unknown and it is of major interest to estimate
them (i.e., community detection). Degree Corrected Block Model (DCBM) is a popular
network model. How to detect communities with the DCBM is an interesting problem,
where the main challenge lies in the degree heterogeneity.
We propose Spectral Clustering On Ratios-of-Eigenvectors (SCORE) as a new
approach to community detection. Compared to classical spectral methods, the main
innovation is to use the entry-wise ratios between the rst leading eigenvector and
each of the other leading eigenvectors. Let X be the adjacency matrix of the network.
We rst obtain the K leading eigenvectors, say, ^1; : : : ; ^K, and let ^R be the n(K????1)
matrix such that ^R(i; k) = ^k+1(i)=^1(i), 1 i n, 1 k K ???? 1. We then use ^R
for clustering by applying the k-means method.
The central surprise is, the eect of degree heterogeneity is largely ancillary, and
can be eectively removed by taking entry-wise ratios between ^k+1 and ^1, 1 k
K ???? 1.
The method is successfully applied to the web blogs data and the karate club
data, with error rates of 58=1222 and 1=34, respectively. These results are much
more satisfactory than that by the classical spectral methods. Also, compared to
modularity methods, SCORE is computationally much faster and has smaller error
rates.
We develop a theoretic framework where we show that under mild conditions, the
SCORE stably yields successful community detection. In the core of the analysis is
the recent development on Random Matrix Theory (RMT), where the matrix-form
Bernstein inequality is especially helpful. |
| Back To Top |
| November 9, 2012 |
| Ming Yuan, School of Industrial & Systems Engineering, Georgia Tech |
| Adaptive Estimation of Large Covariance Matrices |
| November 9, 2012 10:00 am |
| 108 OSB |
|
| Estimation of large covariance matrices has drawn considerable recent attention and the theoretical focus so far is mainly on developing a minimax theory over a fixed parameter space. In this talk, I shall discuss adaptive covariance matrix estimation where the goal is to construct a single procedure which is minimax rate optimal simultaneously over each parameter space in a large collection. The estimator is constructed by carefully dividing the sample covariance matrix into blocks and then simultaneously estimating the entries in a block by thresholding. I shall also illustrate the use of the technical tools developed in other matrix estimation problems.
|
| Back To Top |
| November 7, 2012 |
| David Bristol, Statistical Consulting Services, Inc. |
| Two Adaptive Procedures for Comparing Two Doses to Placebo Using Conditional Power |
| November 7, 2012 3:35 pm |
| 108 OSB |
|
| Adaptive designs have received much attention recently for various goals, including sample size re-estimation and dose selection. Here two adaptive procedures for comparing two doses of an active treatment to placebo with respect to a binomial response variable using a double-blind randomized clinical trial are presented. The goals of the interim analysis are to stop for futility or to continue with one dose or both doses, and placebo, with a possible increase in the sample size for any group that continues. Various properties of the two procedures, which are both based on the concept of conditional power, are presented.
|
| Back To Top |
| November 2, 2012 |
| Jinfeng Zhang, Department of Statistics, FSU |
| Change-point detection for high-throughput genomic data |
| November 2, 2012 10:00 am |
| 108 OSB |
|
|
Analysis of high-throughput genomic data often requires detection of
change-points along a genome. For example, when comparing the
chromatin accessibility of two samples (e.g. normal and cancer cells),
a very essential task is to detect both the locations and the lengths
of genomic regions that have statistically significant differences in
chromatin accessibility between the two samples. Similar tasks are
encountered when comparing DNA copy number variations, nucleosome
occupancy, DNA methylations, and histone modifications of two or
multiple samples. In these experiments, genetic or epigenetic features
are measured along the genome for thousands or millions of genomic
locations. Given two different conditions, many genomic regions can
undergo significant changes. Accurate detection of the changes will
help scientists to understand the biological mechanisms responsible
for the phenotype differences of the samples to be compared. This
problem falls into a more general type of statistical problem, call
change-point problem, which has been actively studied by scientists in
a variety of disciplines in the past a couple decades. However, many
of the existing methods are not suitable for analyzing high-throughput
genomic data. In this talk, I present two related change-point
problems and our solutions to them. We manually annotated a benchmark
dataset and used it to rigorously compare our method to several
popular methods in literature. Our method was shown to perform better
than the previous methods on the benchmark dataset. We further applied
the method to study the effect of drug treatments to chromatin
accessibility and nucleosome occupancy using HDAC inhibitors, a class
of drugs for cancer treatment. |
| Back To Top |
| October 30, 2012 |
| Steve Chung, Ph.D. Candidate |
| Essay Defense: A Class of Nonparametric Volatility Models: Applications to Financial Time Series |
| October 30, 2012 10:00 am |
| 499 DSL |
|
| Over the past few decades, financial volatility modeling has been very active and extensive research area for academics and practitioners. It is still one of the main ongoing research areas in empirical finance and time series economics. We first examine several parametric and nonparametric volatility models in the literature. Some of the popular parametric models include generalized autoregressive conditional heteroscedastic (GARCH), exponential GARCH (EGARCH), and threshold GARCH (TGARCH) models. However, these models rely on explicit functional form assumptions which can lead to model misspecification problem. Nonparametric models, on the other hand, are free from such functional form assumptions and possess model flexibility. In this talk, we show how to estimate financial volatility using multivariate adaptive regression splines (MARS) as a preliminary analysis to build a nonparametric volatility model. Despite its popularity, MARS has never been applied to model financial volatility. To implement the MARS methodology in a time series setting, we let the predictor variables to be lagged values which results in a model referred to as adaptive spline threshold autoregression (ASTAR). The estimation is illustrated through simulations and empirical examples by using historical stock data and exchange rate data. We compare the performance of MARS volatility model with the existing models by using several out-of-sample goodness-of-fit measures. |
| Back To Top |
| October 29, 2012 |
| Emilola Abayomi, Ph.D Candidate, Dissertation |
| The Relationship between Body Mass and Blood Pressure in Diverse Populations |
| October 29, 2012 12:00 pm |
| OSB 215 |
|
| High blood pressure is a major determinant of risk for Coronary Heart Disease (CHD) and stroke, leading causes of death in the industrialized world. A myriad of pharmacological treatments for elevated blood pressure, defined as a blood pressure greater than 140/90mmHg, are available and have at least partially resulted in large reductions in the incidence of CHD and stroke in the U.S. over the last 50 years. The factors that may increase blood pressure levels are not well understood, but body mass is thought to be a major determinant of blood pressure level. Obesity is measured through various methods (skinfolds, waist-to-hip ratio, bioelectrical impedance analysis (BIA), etc.), but the most commonly used measure is body mass index,BMI= Weight(kg)/Height(m)^2.
The relationship between the level of blood pressure and BMI has been perceived to be linear and strong. This thesis examined the relationship of blood pressure and BMI among diverse populations. The Diverse Populations Collaboration is a dataset comprised of almost 30 observational studies from around the world. We conducted a meta-analysis to explore heterogeneity that may be present amongst the relationship in diverse populations. If heterogeneity was present, a meta-regression was conducted to determine if characteristics such as race and gender explain the differences among studies. We also examined the functional form of BMI and blood pressure to determine whether a linear assumption was acceptable when modeling the relationship in all populations.
|
| Back To Top |
| October 26, 2012 |
| Ciprian Crainiceanu, Department of Biostatistics, Johns Hopkins University |
| Longitudinal analysis of high resolution structural brain images |
| October 26, 2012 10:00 am |
| 108 OSB |
|
| The talk will provide a gentle introduction to brain imaging and describe the problems associated with the longitudinal analysis of ultra-high dimensional 3D brain images. In particular, I will describe the work we have done to understand and characterize the micro structure of white matter brain tracts as well as lesion occurrence and development in a large cohort of subjects who suffer of multiple sclerosis. The statistical methods developed are in response to real scientific problems from our first line collaborations with our colleagues from NIH and Johns Hopkins School of Medicine. For more information about the speaker: www.biostat.jhsph.edu/~ccrainic. For more information about the research group: www.smart-stats.org. |
| Back To Top |
| October 19, 2012 |
| Gareth James, Marshall School of Business, University of South California |
| Functional Response Additive Model Estimation |
| October 19, 2012 10:00 am |
| 108 OSB |
|
| While functional regression models have received increasing attention
recently, most existing approaches assume both a linear relationship and a
scalar response variable. We suggest a new method, "Functional Response
Additive Model Estimation" (FRAME), which extends the usual linear
regression model to situations involving both functional predictors, X(t),
and functional responses, Y (t). Our approach uses a penalized least squares
optimization criterion to automatically perform variable selection in
situations involving multiple functional predictors. In addition, our method
uses an efficient coordinate descent algorithm to fit general non-linear
additive relationships between the predictors and response. We apply our
model to the context of forecasting product demand in the entertainment
industry. In particular, we model the decay rate of demand for Hollywood
movies using the predictive power of online virtual stock markets (VSMs).
VSMs are online communities that, in a market-like fashion, gather the
crowds' opinion about a particular product. Our fully functional model
captures the pattern of pre-release VSM trading values and provides superior
predictive accuracy of a movie's demand distribution in comparison to
traditional methods. In addition, we propose graphical tools which give a
glimpse into the causal relationship between market behavior and box office
revenue patterns and hence provide valuable insight to movie decision
makers. |
| Back To Top |
| October 12, 2012 |
| Michelle Arbeitman, College of Medicine, FSU |
| Genes to Behavior: Genomic analyses of sex-specific behaviors |
| October 12, 2012 10:00 am |
| 108 OSB |
|
| My lab is interested in understanding the molecular-genetic basis of complex behaviors. We use the model system Drosophila melanogaster (fruit flies) to address our questions. Drosophila is a ideal model to study behavior as there are powerful tools for molecular-genetic studies and males and female flies display complex reproductive behaviors that are genetically specified by one of the best characterized genetic regulatory hierarchies. My talk will introduce next generation sequencing technologies and some of the computational and statistical challenges in analyzing these data sets. I will also present some of our experimental results on Drosophila sex-specific biology that were obtained utilizing next generation sequencing platforms. |
| Back To Top |
| October 5, 2012 |
| Adrian Barbu, Dept. of Statistics, FSU |
| Feature Selection by Scheduled Elimination |
| October 5, 2012 10:00 am |
| 108 OSB |
|
| Many computer vision and medical imaging problems are faced with learning classifiers from large datasets, with millions of observations and features. In this work we propose a novel efficient algorithm for variable selection and learning on such datasets, optimizing a constrained penalized likelihood without any sparsity inducing priors. The iterative suboptimal algorithm alternates parameter updates with tightening the constraints by gradually removing variables based on a criterion and a schedule. We present a generic approach applicable to any differentiable loss function and present an application to logistic regression. We use one dimensional piecewise linear response functions for nonlinearity and introduce a second order prior on the response functions to avoid overfitting. Experiments on real and synthetic data show that the proposed method usually outperforms Logitboost and L1-penalized methods for both variable selection and prediction while being computationally faster.
|
| Back To Top |
| September 28, 2012 |
| Vladimir Koltchinskii, Dept. of Mathematics, Georgia Tech |
| Complexity Penalization in Low Rank Matrix Recovery |
| September 28, 2012 10:00 am |
| 108 OSB |
|
| The problem of estimation of a large Hermitian matrix based on random linear measurements will be discussed. Such problems have been intensively studied in the recent years in the cases when the target matrix has relatively small rank, or it can be well approximated by small rank matrices. Important examples include matrix completion, where a random sample of entries of the target matrix is observed, and quantum state tomography, where the target matrix is a density matrix of a quantum system and it has to be estimated based on the measurements of a finite number of randomly picked observables. We will consider several approaches to such problems based on a penalized least squares method (and its modifications) with complexity penalties defined in terms of nuclear norm, von Neumann entropy and other functionals that “promote” small rank solutions and discuss oracle inequalities for the resulting estimators with explicit dependence of the error terms on the rank and other parameters of the problem. We will also discuss a
version of these methods when the target matrix is a “smooth ” low rank
kernel defined on a large graph and the goal is to design estimators that
are adaptive simultaneously to the rank of the kernel and to its degree of smoothness.
|
| Back To Top |
| September 21, 2012 |
| Xiaotong Shen, John Black Johnston Distinguished Professor, School of Statistics, University of Minnesota |
| On personalized information filtering |
| September 21, 2012 10:00 am |
| 108 OSB |
|
| Personalized information filtering extracts the information specifically
relevant to a user, based on the opinions of users who think alike or the
content of the items that a specific user prefers. In this presentation,
we discuss latent models to utilize additional user-specific and
content-specific predictors, for personalized prediction. In particular, we factorize a user-over-item preference matrix into a product of two matrices,
each having the same rank as the original matrix. On this basis, we seek a sparsest latent factorization from a class of overcomplete factorizations, possibly with a high percentage of missing values. A likelihood approach is discussed, with an emphasis towards scalable computation. Examples will be given to contrast with popular methods for collaborative filtering and contented-based filtering.
This work is joint with Changqing Ye and Yunzhang Zhu.
|
| Back To Top |
| September 14, 2012 |
| Xiuwen Liu, FSU Dept. of Computer Science |
| Quantitative Models for Nucleosome Occupancy Prediction |
| September 14, 2012 10:00 am |
| 108 OSB |
|
|
Nucleosome is the basic unit of DNA in eukaryotic cells. As nucleosomes limit the accessibility of the wrapped DNA to transcription factors and other DNA-binding proteins, their positions play an essential role in regulations of gene activities. Experiments have indicated that DNA sequence itself strongly influences nucleosome positioning by enhancing or reducing their binding affinity to nucleosomes, therefore providing an intrinsic cell regulatory mechanism. In this talk I will present quantitative models that I have developed for nucleosome occupancy prediction with Prof. Jonanthan Dennis and my students. In particular, I will focus on two models we have proposed recently. The first one is a new dinucleotide matching model, where we propose a new feature set for nucleosome occupancy prediction and learn the parameters via regression; evaluation using a genome-wide dataset shows that our model gives most accurate prediction than existing models. The second one is a new algorithm to achieve the ultimate single basepair resolution in localizing nucleosomes by posing the genome-wide localization problem as a classification using datasets via chemical mapping.
Short Bio: Xiuwen Liu received his PhD from the Ohio State University in 1999 in Computer and Information Science and joined the Department of Computer Science at the Florida State University in 2000, where he is a full professor. His recent areas of research interest include computational models for Biology, image analysis, machine learning, computer security, and manifold-based modeling for security in cyber-physical systems.
|
| Back To Top |
| August 9, 2012 |
| Senthil Girimurugan |
| Detecting differences in Signals via reduced dimension Wavelets |
| August 9, 2012 11:00 am |
| OSB 215 |
|
| All processes in engineering and other fields of science either have a signal as an output or contain an underlying signal that describes the process. A process can be understood in detail by analyzing the associated signal in an efficient manner. In statistical quality control, such an analysis is carried out by monitoring profiles (signals) and detecting differences between an in-control (IC) and an out-of-control (OOC) signal. The dimensions of profiles have increased tremendously with recent advancements in technology resulting in an increased complexity of analysis. In this work, we explore several methods in detecting signal differences by reducing dimension using Wavelets. The methodology involves the well-known Hotelling T2 statistic improved by Wavelets. In the current work, a statistical power analysis is conducted to determine the efficiency of this statistic in detecting local, global differences and laying a foundation to a Wavelet based ANOVA setup involving the proposed statistic. Also, as an application, the proposed methodology is applied to detect differences in genetic data. |
| Back To Top |
| May 4, 2012 |
| Jingyong Su, FSU Dept. of Statistics |
| Estimation, Analysis and Modeling of Random Trajectories on Nonlinear |
| May 4, 2012 10:00 am |
| OSB 215 |
|
| A growing number of datasets now contain both a spatial and a temporal dimension. Trajectories are natural spatiotemporal data descriptors. Estimation, analysis and modeling of such trajectories are thus becoming increasingly important in many applications ranging from computer vision to medical imaging. Many problems in these areas are naturally posed as problems on nonlinear manifolds. This is because there are some intrinsic constraints on the pertinent features that force the corresponding representations to these manifolds. There are many difficulties when estimating and analyzing random trajectories on nonlinear manifold. First, most of standard techniques on Euclidean spaces cannot be directly extended to nonlinear manifolds. Furthermore, such trajectories are always noisy, parametrized. In this work, we begin by estimating full paths on common nonlinear manifolds using only a set of time-indexed points, for use in interpolation, smoothing, and prediction of dynamic systems. Next, we address the problem of registration and comparison of such temporal trajectories. In future work, we will focus on modeling random trajectories on nonlinear manifolds. |
| Back To Top |
| April 27, 2012 |
| Ester Kim, FSU Dept of Statistics |
| An Ensemble Approach to Predict the Risk of Coronary and Cardiovascular Disease |
| April 27, 2012 3:30 pm |
| OSB 215 |
|
| Coronary and cardiovascular diseases continue to be the leading cause of mortality in the United States and across the globe. They are also estimated to have the highest medical expenditures in the United States among chronic diseases. Early detection of the development of a heart disease plays a critical role in preserving heart health and its accurate prediction is highly valuable information for early treatment. For the past few decades, estimates of coronary or cardiovascular risks have been based on logistic regression or Cox proportional hazards models. In more recent years, machine learning models have grown in popularity within the medical field, but few have been applied in disease prediction, particularly for coronary or cardiovascular risks.
We first evaluate the predictive performance of the machine learning models, the multilayer perceptron network and the k-nearest neighbor, to the statistical models logistic regression and the Cox proportional hazards. Our aim is to combine these predictive models into one model in an ensemble approach for a superior classification performance.
The ensemble approaches include bagging, which is a bootstrap aggregating model, and a multimodel ensemble, which is a combination of independently constructed models. The ensemble models are also evaluated for predictive performance comparative to the single models. Various measures and methods are used to evaluate the models’ performances based on the Framingham Heart Study data.
|
| Back To Top |
| April 27, 2012 |
| Sebastian Kurtek, Ph.D Candidate, Dissertation |
| Riemannian Shape Analysis of Curves and Surfaces |
| April 27, 2012 10:00 am |
| |
|
| Shape analysis of curves and surfaces is a very important tool in many applications ranging from computer vision to bioinformatics and medical imaging. There are many difficulties when analyzing shapes of parameterized curves and surfaces. Firstly, it is important to develop representations and metrics such that the analysis is invariant to parameterization in addition to the standard transformations (rigid motion and scaling). Furthermore, under the chosen representations and metrics, the analysis must be performed on infinite-dimensional and sometimes non-linear spaces, which poses an additional difficulty. In this work, we develop and apply methods, which address these issues. We begin by defining a framework for shape analysis of parameterized open curves and extend these ideas to shape analysis of surfaces. We utilize the presented frameworks in various classification experiments spanning multiple application areas. In the case of curves, we consider the problem of clustering DT-MRI brain fibers, classification of protein backbones, modeling and segmentation of signatures and statistical analysis of biosignals. In the case of surfaces, we perform disease classification using 3D anatomical structures in the brain, classification of handwritten digits by viewing images as quadrilateral surfaces, and finally classification of cropped facial surfaces. We provide two additional extensions of the general shape analysis frameworks that are the focus of this thesis. The first one considers shape analysis of marked spherical surfaces where in addition to the surface information we are given a set of manually or automatically generated landmarks. This requires additional constraints on the definition of the re-parameterization group and is applicable in many domains, especially medical imaging and graphics. Second, we consider reflection symmetry analysis of planar closed curves and spherical surfaces. Here, we also provide an example of disease detection based on brain asymmetry measures. We close with a brief summary and a discussion of open problems, which we plan on exploring in the future. |
| Back To Top |
| April 20, 2012 |
| Sunil Rao, University of Miami |
| Best Predictive Estimation for Linear Mixed Models with Applications to Small Area Estimation |
| April 20, 2012 10:00 am |
| OSB 110 |
|
| We derive the best predictive estimator (BPE) of the fixed parameters
for a linear mixed model. This leads to a new prediction procedure
called observed best prediction (OBP), which
is different from the empirical best linear unbiased prediction
(EBLUP). We show that BPE is more reasonable than the traditional
estimators derived from estimation considerations, such as maximum
likelihood (ML) and restricted maximum likelihood (REML), if the main
interest is the prediction of the mixed effect. We show how the OBP
can significantly outperform the EBLUP in terms of mean
squared prediction error (MSPE) if the underlying model is
misspecified. On the other hand, when the underlying model is
correctly specified, the overall predictive performance of the OBP
can be very similar to the EBLUP. The well known Fay-Herriot small area
model is used as an illustration of the methodology. In addition,
simulations and analysis of a data set on graft failure rates
from kidney transplant operations will be used to show empirical
performance. This is joint work with Jiming Jiang of UC-Davis and
Thuan Nguyen of Oregon Health and Science University. |
| Back To Top |
| April 13, 2012 |
| Gretchen Rivera, FSU Dept. of Statistics |
| Meta Analysis of Measures of Discrimination and Prognostic Modeling |
| April 13, 2012 10:00 am |
| OSB 110 |
|
| In this paper we are interested in predicting death with the underlying cause of coronary heart disease (CHD). There are two prognostic modeling methods used to predict CHD: the logistic model and the proportional hazard model. For this paper, the logistic model has been used.
The dataset used is the Diverse Populations Collaboration (DPC) dataset, which includes 28 studies. The DPC dataset has epidemiological results from investigation conducted in different populations around the world. For our analysis we include those individuals who are 17 years old or older. My predictors are: age, diabetes, total serum cholesterol (mg/dl), systolic blood pressure (mmHg) and if the participant is a current cigarette smoker. There is a natural grouping within the studies such as gender, rural or urban area and race. Based on these strata we have 70 cohort groups.
Our main interest is to evaluate how well the prognostic modeling discriminates. For this, we used the area under the Receiver Operating Characteristic (ROC) curve. The main idea of the ROC curve is that a set of subject is known to belong to one of two classes (signal or noise group). Then an assignment procedure assigns each object to a class on the basis of information observed. The assignment procedure is not perfect: sometimes an object is misclassified. We want to evaluate the quality of performance of this procedure, for this we used the Area under the ROC curve (AUC). The AUC varies from 0.5 (no apparent accuracy) to 1.0 (perfect accuracy). For each logistic model we found the AUC and its standard error (SE). Given the association between the AUC and the Wilcoxon statistic we use the Wilcoxon statistic to estimate the SE. We used Meta-analysis to find the overall AUC and to evaluate if there is heterogeneity in our estimates. To evaluate the extent of heterogeneity we used the Q statistic. Since, heterogeneity was found in our study we compare seven different methods for estimating between study variance.
|
| Back To Top |
| April 6, 2012 |
| Xu Han, University of Florida |
| False Discovery Control Under Arbitrary Dependence |
| April 6, 2012 10:00 am |
| OSB 110 |
|
| Multiple hypothesis testing is a fundamental problem in high dimensional inference, with wide
applications in many scientific fields. In genome-wide association studies, tens of thousands of
hypotheses are tested simultaneously to find if any genes are associated with some traits; in finance,
thousands of tests are performed to see which fund managers have winning ability. In practice,
these tests are correlated. False discovery control under arbitrary covariance dependence is a very
challenging and important open problem in the modern research.
We propose a new methodology based on principal factor approximation, which successfully ex-
tracts the common dependence and weakens significantly the correlation structure, to deal with an
arbitrary dependence structure. We derive the theoretical distribution for false discovery proportion
(FDP) in large scale multiple testing when a common threshold is used for rejection, and provide
a consistent estimate of FDP. Specifically, we decompose the test statistics into an approximate
multifactor model with weakly dependent errors, derive the factor loadings and estimate the
unobserved but realized factors which account for the dependence by L1- regression. Asymptotic theory
is derived to justify the consistency of our proposed method. This result has important applications
in controlling FDR and FDP. The nite sample performance of our procedure is critically evaluated by various simulation studies. Our estimate of FDP compares favorably with Efron (2007)'s approach, as demonstrated by in the simulated examples. Our approach is further illustrated by some real data in genome-wide association studies.
This is joint work with Professor Jianqing Fan and Mr. Weijie Gu at Princeton University.
fields. In genome-wide association studies, tens of thousands of
hypotheses are tested simultaneously to find if any genes are associated with some traits; in finance,
thousands of tests are performed to see which fund managers have winning ability. In practice,
these tests are correlated. False discovery control under arbitrary covariance dependence is a very
challenging and important open problem in the modern research.
We propose a new methodology based on principal factor approximation, which successfully ex-
tracts the common dependence and weakens significantly the correlation structure, to deal with an
arbitrary dependence structure. We derive the theoretical distribution for false discovery proportion
(FDP) in large scale multiple testing when a common threshold is used for rejection, and provide
a consistent estimate of FDP. Specifically, we decompose the test statistics into an approximate
multifactor model with weakly dependent errors, derive the factor loadings and estimate the
unobserved but realized factors which account for the dependence by L1- regression. Asymptotic theory
is derived to justify the consistency of our proposed method. This result has important applications
in controlling FDR and FDP. The nite sample performance of our procedure is critically evaluated by various simulation studies. Our estimate of FDP compares favorably with Efron (2007)'s
approach, as demonstrated by in the simulated examples. Our approach is further illustrated by
some real data in genome-wide association studies.
This is joint work with Professor Jianqing Fan and Mr. Weijie Gu at Princeton University.
|
| Back To Top |
| March 30, 2012 |
| Jordan Cuevas, Ph.D Candidate, Dissertation |
| Estimation and Sequential Monitoring of Nonlinear Functional Responses Using Wavelet Shrinkage |
| March 30, 2012 2:00 pm |
| OSB 108 |
|
| Statistical process control (SPC) is widely used in industrial settings to monitor processes for shifts in their distributions. SPC is generally thought of in two distinct phases: Phase I, in which historical data is analyzed in order to establish an in-control process, and Phase II, in which new data is monitored for deviations from the in-control form.
Traditionally, SPC had been used to monitor univariate (multivariate) processes for changes in a particular parameter (parameter vector). Recently however, technological advances have resulted in processes in which each observation is actually an n-dimensional functional response (referred to as a profile), where n can be quite large. Additionally, these profiles are often unable to be adequately represented parametrically, making traditional SPC techniques inapplicable.
This dissertation starts out by addressing the problem of nonparametric function estimation, which would be used to analyze process data in a Phase-I setting. The translation invariant wavelet estimator (TI) is often used to estimate irregular functions, despite the drawback that it tends to oversmooth jumps. A trimmed translation invariant estimator (TTI) is proposed, of which the TI estimator is a special case. By reducing the point by point variability of the TI estimator, TTI is shown to retain the desirable qualities of TI while improving reconstructions of functions with jumps.
Attention is then turned to the Phase-II problem of monitoring sequences of profiles for deviations from in-control. Two profile monitoring schemes are proposed; the first monitors for changes in the noise variance using a likelihood ratio test based on the highest detail level of wavelet coefficients of the observed profile. The second offers a semiparametric test to monitor for changes in both the functional form and noise variance. Both methods make use of wavelet shrinkage in order to distinguish relevant functional information from noise contamination. Different forms of each of these test statistics are proposed and results are compared via Monte Carlo simulation.
|
| Back To Top |
| March 30, 2012 |
| Jinfeng Zhang, FSU Dept. of Statistics |
| Statistical approaches for protein structure comparison and their applications in protein function prediction |
| March 30, 2012 10:00 am |
| OSB 110 |
|
| Comparison of protein structures is important for revealing the
evolutionary relationship among proteins, predicting protein functions
and predicting protein structures. Many methods have been developed in
the past to align two or multiple protein structures. Despite the
importance of this problem, rigorous mathematical or statistical
frameworks have seldom been pursued for general protein structure
comparison. One notable issue in this field is that with many
different distances used to measure the similarity between protein
structures, none of them are proper distances when protein structures
of different sequences are compared. Statistical approaches based on
those non-proper distances or similarity scores as random variables
are thus not mathematically rigorous. In this work, we develop a
mathematical framework for protein structure comparison by treating
protein structures as three-dimensional curves. Using an elastic
Riemannian metric on spaces of curves, geodesic distance, a proper
distance on spaces of curves, can be computed for any two protein
structures. In this framework, protein structures can be treated as
random variables on the shape manifold, and means and covariance can
be computed for populations of protein structures. Furthermore, these
moments can be used to build Gaussian-type probability distributions
of protein structures for use in hypothesis testing. Our method
performs comparably with commonly used methods in protein structure
classification, but with a much improved speed. Some recent result on
comparison of protein surfaces will also be presented. |
| Back To Top |
| March 29, 2012 |
| Paul Hill |
| Bootstrap Prediction Bands for Non-Parametric Function Signals in a Complex System |
| March 29, 2012 2:00 pm |
| BEL 243 |
|
| Methods employed in the construction of prediction bands for continuous curves require
a different approach to those used for a data point. In many cases, the underlying function is unknown and thus a distribution-free approach which preserves sufficient coverage for the signal in its entirety is necessary in the signal analysis. Four methods for the formation of (1-?) 100% prediction and containment bands are presented and their performances are compared through the coverage probabilities obtained.
These techniques are applied to constructing prediction bands for spring discharge in a successful manner giving good coverage in each case. Spring discharge measured over time can be considered as a continuous signal and the ability to predict the future signals of spring discharge is useful for monitoring flow and other issues such as contaminant influence related to the spring.
There has been common use of the gamma distribution in the simulation of rainfall. We propose a bootstrapping method to simulate rainfall. This allows for adequately creating new samples over different periods of time as well as specific rain events such as hurricanes or drought. Both non-windowed and windowed approaches to bootstrapping the recharge are considered as well as the resulting effects on the prediction band coverage for the spring discharge. This non-parametric approach to the input rainfall augurs well for the non-parametric nature of the output signal.
In addition to the above, the question arises as to whether the discharge is dependent
on the pathway navigated by the flow. These pathways are referred to as "trees" and are of great interest because identifying significant differences between trees leads to establishing a classification for them which could aid in better establishing a model that fits any given input recharge data. A T2 test assumes multivariate normality. Since we cannot make that assumption in this instance, a non-parametric approach with less rigorous assumptions is desired. A classification test via the k-means clustering process is utilized to distinguish between the pathways taken by the flow of the discharge in the spring.
|
| Back To Top |
| March 28, 2012 |
| Rachel Becvarik , FSU Dept. of Statistics |
| An Alternative Upper Control Limit to the Average Run Length to Balance Power and False Alarms |
| March 28, 2012 9:00 am |
| OSB 215 |
|
| It has been shown likelihood ratio tests successfully monitor for changes in profiles involving high dimensional nonlinear data. These methods focus on using a traditional flat line upper control limit (UCL) based on average run length (ARL). The current methods do not take into consideration either the error or power associated with the test or the underlying distribution of the ARL. Additionally, if the statistic is known to be increasing over time, the flat UCL does not adapt to the increase. This paper will focus on a method to find the most powerful UCL for an increasing statistic at a specified type I error. |
| Back To Top |
| March 27, 2012 |
| Jihyung Shin, FSU Dept. of Statistics |
| Mixed-effects and mixed-distribution models for count data with applications to educational research data. |
| March 27, 2012 3:30 pm |
| OSB 215 |
|
| This research is motivated by an analysis of reading research data. We are interested in modeling the test outcome of ability to fluently recode letters into sounds of kindergarten children aged between 5 and 7. The data showed excessive zero scores (more than 30% of children) on the test. In this dissertation, we carefully examine the models dealing with excessive zeros, which are based on the mixture of distributions, a distribution with zeros and a standard probability distribution with non-negative values. In such cases, a lognormal variable or a Poisson random variable is often observed with probability from semicontinuous data or count data. The previously proposed models, mixed-effects and mixed-distribution models by Tooze(2002) et al. for semicontinuous data and zero-inflated Poisson regression models by Lambert(1992) for count data are reviewed. Then, we apply zero-inflated Poisson models to repeated measures data of zero-inflated data by introducing a pair of possibly correlated random effects to the zero-inflated Poisson model to accommodate within-subject correlation and between subject heterogeneity. The likelihood function is maximized using dual quasi-Newton optimization of an approximated by adaptive Gaussian quadrature through standard statistical software package. The simulation study and application results are also presented. |
| Back To Top |
| March 26, 2012 |
| Jianchang Lin |
| Semiparametric Bayesian survival analysis using models with log-linear median |
| March 26, 2012 1:00 pm |
| 215 OSB |
|
| First, we present two novel semiparametric survival models with log-linear median regression
functions for right censored survival data. These models are useful alternatives to the
popular Cox (1972) model and linear transformation models (Cheng et al., 1995). Compared
to existing semiparametric models, our models have many important practical advantages,
including interpretation of the regression parameters via the median and the ability to
address heteroscedasticity. We demonstrate that our modeling techniques facilitate the
ease of prior elicitation and computation for both parametric and semiparametric Bayesian
analysis of survival data. We illustrate the advantages of our modeling, as well as model
diagnostics, via reanalysis of a small-cell lung cancer study. Results of our simulation study
provide further guidance regarding appropriate modelling in practice.
Our second goal is to develop the methods of analysis and associated theoretical properties
for interval censored and current status survival data. These new regression models
use log-linear regression function for the median. We present frequentist and Bayesian procedures
for estimation of the regression parameters. Our model is a useful and practical
alternative to the popular semiparametric models which focus on modeling the hazard function.
We illustrate the advantages and properties of our proposed methods via reanalyzing
a breast cancer study.
Our other aim is to develop a model which is able to account for the heteroscedasticity
of response, together with robust parameter estimation and outlier detection using sparsity
penalization. Some preliminary simulation studies have been conducted to compare the
performance of proposed model and existing median lasso regression model. Considering
the estimation bias, mean squared error and other identification benchmark measures, our
proposed model performs better than the competing frequentist estimator. |
| Back To Top |
| March 23, 2012 |
| Bob Clickner, FSU Dept. of Statistics |
| Statistical Investigation of the Relationship between Fish Consumption and Mercury in Blood |
| March 23, 2012 10:00 am |
| OSB 110 |
|
| Fish and shellfish are an important and healthy source of many nutrients, including protein, vitamins, omega-3 fatty acids and others. However, humans are also exposed to methylmercury (MeHg) through the consumption of finfish and shellfish. Mercury released into the environment is converted to MeHg in soils and sediments and bioaccumulates through aquatic food webs. This bioaccumulation leads to increased levels of MeHg in large, predatory fish. MeHg exposure in utero is associated with adverse health effects, e.g., neuropsychological deficits such as IQ and motor function deficits, in children. Over a period of several years, we studied exposure to MeHg via fish and shellfish consumption through a series of statistical analyses of data on fish tissue mercury concentrations and 1999-2008 NHANES blood mercury concentrations and fish consumption data in women of reproductive age (16-49 years). The objective was to investigate the strength and level of the association and patterns in fish consumption and mercury exposure, including demographic, socio-economic, geographic, and temporal trends. Blood MeHg was calculated from the blood total and inorganic concentrations after imputing below-detection-limit concentrations. NHANES dietary datasets were combined to estimate 30-day finfish/shellfish consumption. Fish tissue mercury concentrations were combined with the NHANES data to estimate 30-day mercury intake per gram of body weight. Linear and logistic regression analyses were used to evaluate associations and trends, adjusting for demographic characteristics. |
| Back To Top |
| March 16, 2012 |
| Wei Wu, FSU Dept. of Statistics |
| Consistency Theory for Signal Estimation under Random Time-Warping |
| March 16, 2012 10:00 am |
| OSB 110 |
|
| Function registration/alignment is one of the central problems in Functional Data Analysis and has been extensively investigated over the past two decades. Using a generative model, this problem can also be studied as a problem of estimating signal observed under random time-warpings. An important requirement here is that the estimator should to be consistent, i.e. it converges to the underlying deterministic function when the observation size goes to infinity. This has not been accomplished by previous methods in general terms. We have recently introduced a novel framework for estimating the unknown signal under random warpings, and have shown its superiority to the state-of-the-art performance in function registration/alignment. Here we demonstrate that the proposed algorithm leads to a consistent estimator of the underlying signal. This estimation is also illustrated with convincing examples. Furthermore, we extend our method to estimation for multi-dimensional signals by providing rigorous proofs and illustrative examples. This is joint work with Anuj Srivastava. |
| Back To Top |
| March 2, 2012 |
| Piyush Kumar, FSU Dept. of Computer Science |
| Instant approximate 1-center on roads |
| March 2, 2012 10:00 am |
| OSB 110 |
|
| Computing the mean, center or median is one of the fundamental tasks
in many applications. In this talk, I will present an algorithm to
compute 1-center solutions on road networks, an important problem in
GIS. Using Euclidean embeddings, and reduction to fast nearest
neighbor search, we devise an approximation algorithm for this
problem. Our initial experiments on real world data sets indicate fast
computation of constant factor approximate solutions for query sets
much larger than previously computable using exact techniques. Our
techniques extend to k-clustering problems as well. I will end with
some interesting open problems we are working on.
This is joint work with my students : Samidh Chatterjee, James McClain
and Bradley Neff. |
| Back To Top |
| March 1, 2012 |
| Jun Li, Dept. of Statistics, Stanford University |
| "Differential Expression Identification and False Discovery Rate Estimation in RNA-Seq Data" |
| March 1, 2012 11:00 am |
| OSB 215 |
|
| RNA-Sequencing (RNA-Seq) is taking place of microarrays and becoming the primary tool for measuring genome-wide transcript expression. We discuss the identification of features (genes, isoforms, exons, etc.) that are associated with an outcome in RNA-Seq and other sequencing-based comparative genomic experiments. That is, we aim to find features that are differentially expressed in samples in different biological conditions or under different disease statuses. RNA-Seq data take the form of counts, so models based on the normal distribution are generally unsuitable. The problem is especially challenging because different sequencing experiments may generate quite different total numbers of reads, or “sequencing depths”. Existing methods for this problem are based on Poisson or negative-binomial models: they are useful but can be heavily influenced by “outliers” in the data. We introduce a simple, non-parametric method with resampling to account for the different sequencing depths. The new method is more robust than parametric methods. It can be applied to data with quantitative, survival, two-class, or multiple-class outcomes. We compare our proposed method to Poisson and negative-binomial based methods in simulated and real data sets, and find that our method discovers more consistent patterns than competing methods. |
| Back To Top |
| February 29, 2012 |
| Cun-Hui Zhang, Rutgers University Dept. of Statistics |
| Statistical Inference with High-Dimensional Data |
| February 29, 2012 3:30 pm |
| OSB 108 |
|
| We propose a semi low-dimensional (LD) approach for statistical analysis of certain types
of high-dimensional (HD) data. The proposed approach is best described with the following
model statement:
model = LD component + HD component.
The main objective of this semi-LD approach is to develop statistical inference procedures for the LD component, including p-values and confidence regions. This semi-LD approach is very much inspired by the semiparametric approach in which a statistical model is decomposed as follows:
model = parametric component + nonparametric component.
Just as in the semiparametric approach, the worst LD submodel gives the minimum Fisher information for the LD component, along with an efficient score function. The efficient score function, or an estimate of it, can be used to derive an efficient estimator for the LD component. The efficient estimator is asymptotically normal with the inverse of the minimum Fisher information as its asymptotic covariance matrix. This asymptotic covariance matrix may be consistently estimated in a natural way. Consequently, approximate confidence intervals and p-values can be constructed.
|
| Back To Top |
| February 29, 2012 |
| Daniel Osborne, Ph.D candidate, FSU Dept. of Statistics |
| Nonparametric Data Analysis on Manifolds with Applications in Medical Imaging |
| February 29, 2012 10:30 am |
| Montgomery Gym (Mon) Rm 102 |
|
| Over the past twenty years, there has been a rapid development in Nonparametric Statistical Analysis on Manifolds applied to Medical Imaging problems. In this body of work, we focus on two different medical imaging problems. The first problem corresponds to analyzing the CT scan data. In this context, we perform nonparametric analysis on the 3D data retrieved from CT scans of healthy young adults, on the Size-and-Reflection Shape Space SR?_3,0^k of k-ads in general position in 3D. This work is a part of larger project on planning reconstructive surgery in severe skull injuries which includes preprocessing and post-processing steps of CT images. The next problem corresponds to analyzing MR diffusion tensor imaging data. Here, we develop a two-sample procedure for testing the equality of the generalized Frobenius means of two independent populations on the space of symmetric positive matrices. These new methods, naturally lead to an analysis based on Cholesky decompositions of covariance matrices which helps to decrease computational time and does not increase dimensionality. The resulting nonparametric matrix valued statistics are used for testing if there is a difference on average between corresponding signals in Diffusion Tensor Images (DTI) in young children with dyslexia when compared to their clinically normal peers. The results presented here correspond to data that was previously used in the literature using parametric methods which also showed a significant difference. |
| Back To Top |
| February 28, 2012 |
| Eric Lock, Dept of Statistics, University of North Carolina at Chapel Hill |
| Joint and Individual Variation Explained (JIVE) for Integrated Analysis of Multiple Datatypes. |
| February 28, 2012 3:30 pm |
| OSB 110 |
|
| Research in a number of fields now requires the analysis of datasets in which multiple high-dimensional types of data are available for a common set of objects. We introduce Joint and Individual Variation Explained (JIVE), a general decomposition of variation for the integrated analysis of such datasets. The decomposition consists of three terms: a low-rank approximation capturing joint variation across datatypes, low-rank approximations for structured variation individual to each datatype, and residual noise. JIVE quantifies the amount of joint variation between datatypes, reduces the dimensionality of the data in an insightful way, and provides new directions for the visual exploration of joint and individual structure. The proposed method represents an extension of Principal Component Analysis and has clear advantages over popular two-block methods such as Canonical Correlation Analysis and Partial Least Squares. We describe a JIVE analysis of gene expression and microRNA data for cancerous tumor samples, and discuss additional applications. This is joint work with Andrew Nobel, J.S. Marron and Katherine Hoadley. |
| Back To Top |
| February 27, 2012 |
| Kelly McGinnity, FSU Dept. of Statistics |
| Nonparametric Cross-Validated Wavelet Thresholding for Non-Gaussian Errors |
| February 27, 2012 11:00 am |
| OSB 215 |
|
| Wavelet thresholding generally assumes independent, identically distributed Gaussian
errors when estimating functions in a nonparametric regression setting. VisuShrink
and SureShrink are just two of the many common thresholding methods based on this
assumption. When the errors are not normally distributed, however, few methods have
been proposed. In this paper, a distribution-free method for thresholding wavelet coefficients
in nonparametric regression is described. Unlike some other non-normal error
thresholding methods, the proposed method does not assume the form of the nonnormal
distribution is known. A simulation study shows the efficiency of the proposed
method on a variety of non-Gaussian errors, including comparisons to existing wavelet
threshold estimators. |
| Back To Top |
| February 16, 2012 |
| Alec Kercheval, FSU Dept. of Mathematics |
| A generalized birth-death stochastic model for high-frequency order book dynamics in the electronic stock market |
| February 16, 2012 2:00 pm |
| DSL 499 |
|
| The limit order book is an electronic clearing house for limit and market orders operated by the major stock exchanges. Computer driven traders interact with the exchange using this order book on the millisecond time scale. Traders and regulators are interested in understanding the dynamics of this object as it can affect the economy as a whole, now that more than 50% of all trading volume on the NYSE is from automated trades.
In this talk we look at the structure of the limit order book and discuss ways to model the evolution of prices in order to
compute probabilities of interest to traders.
|
| Back To Top |
| February 10, 2012 |
| Jennifer Geis, Ph.D. candidate, FSU Dept. of Statistics |
| Adaptive Canonical Correlation Analysis through a Weighted Rank Selection Criterion: Inferential Methods for Multivariate Response Models with Applications to a HIV/Neurocognitive Study |
| February 10, 2012 3:30 pm |
| OSB 108 |
|
|
Multivariate response models are being used increasingly more in almost all fields, employing inferential methods such as Canonical Correlation Analysis (CCA). This requires the estimation of the number of canonical relationships, or, equivalently so, determining the rank of the coefficient estimator which may be done using the Rank Selection Criterion (RSC) by Bunea et al. under an i.i.d. assumption on the error terms. While necessary to show their strong theoretical results, some flexibility is required in practical application. What is developed here are theoretics for the large sample setting that parallels their work, providing support for the addition of a ``decorrelator'' weight matrix. One such possibility in the large sample setting is the sample residual covariance. However, a computationally more convenient weight matrix is the sample response covariance. When such a weight matrix is chosen, CCA is directly accessible by this weighted version of RSC giving an Adaptive CCA (ACCA).
However, particular considerations are required for the high dimensional setting as similar theoretics no longer hold. What will be offered instead are extensive simulations that will reveal that using the sample response covariance still provides good rank recovery and estimation of the coefficient matrix, and hence, also good estimation of the number of canonical relationships and variates. It will be argued precisely why other versions of the residual covariance, including a regularized version, are poor choices in the high dimensional setting. Another approach to avoid these issues is to employ some type of variable selection methodology first before applying ACCA for inferential conclusions. Truly, any group selection method may be applied prior to ACCA as variable selection in the multivariate response model is the same as group selection in the univariate response model and thus completely eliminates these other concerns.
To offer a practical application of these ideas, ACCA will be applied to a neuroimaging dataset. A high dimensional dataset will be generated from this large sample set to which Group LASSO will be first utilized before ACCA. A unique perspective may then be offered into the relationships between cognitive deficiencies in HIV-positive patients and extensive, available neuroimaging measures.
|
| Back To Top |
| February 10, 2012 |
| Debdeep Pati |
| Nonparametric Bayes learning of low dimensional structure in big objects |
| February 10, 2012 10:00 am |
| OSB 110 |
|
| The first part of the talk will focus on Bayesian nonparametric models for learning low-dimensional structure underlying higher dimensional objects with special emphasis on models for 2D and 3D shapes where the data typically consists of points embedded in 2D pixelated images or a cloud of points in $\mathbb{R}^3$. Models for distributions of shapes can be widely used in biomedical applications ranging from tumor tracking for targeted radiation therapy to classifying cells in a blood sample. We propose tensor product-based Bayesian probability models for 2D closed curves and 3D closed surfaces. We initially consider models for a single surface using a cyclic basis and array shrinkage priors. The model avoids parameter constraints, leads to highly efficient posterior computation, and has strong theoretical properties including near minimax optimal rates. Focusing on the 2D case, we also develop a multiscale deformation model for joint alignment and analysis of related shapes motivated by data on images containing many related objects. Efficient and scalable algorithms are developed for posterior computation, and the models are applied to 3D surface estimation data from the literature and 2D imaging data on cell shapes. In developing general purpose models for potentially high-dimensional objects and surfaces, it is important to consider theoretical properties. In the final part of the talk, we give an overview of our recent theoretical results on large support, consistency and minimax optimal rates in Bayesian models for regression surfaces and density regression. |
| Back To Top |
| February 3, 2012 |
| Zhihua Sophia Su |
| Envelope Models and Methods |
| February 3, 2012 10:00 am |
| OSB 110 |
|
| This talk presents a new statistical concept called an envelope. An envelope has the potential to achieve substantial efficiency gains in multivariate analysis by identifying and cleaning up immaterial information in the data. The efficiency gains will be demonstrated both by theory and example. Some recent developments in this area, including partial envelopes and inner envelopes, will also be discussed. They refine and extend the enveloping idea, adapting it to more data types and increasing the potential to achieve efficiency gains. Applications of envelopes and their connection to other fields will also be mentioned. |
| Back To Top |
| January 27, 2012 |
| Harry Crane |
| Partition-valued Processes and Applications to Phylogenetic Inference |
| January 27, 2012 10:00 am |
| OSB 110 |
|
| In this talk, we present the cut-and-paste process, a novel infinitely exchangeable process on the state space of partitions of the natural numbers whose sample paths differ from previously studied exchangeable coalescent (Kingman 1982; Pitman 1999) and fragmentation (Bertoin 2001) processes. We discuss some mathematical properties of this process as well as a two parameter subfamily which has a matrix as one of its parameters. This matrix can be interpreted as a similarity matrix for pairwise relationships and has a natural application to inference of the phylogenetic tree of a group of species for which we have mitochondrial DNA data. We compare the results of this inference to those of some other methods and discuss some computational issues which arise as well as some natural extensions of this model to Bayesian inference, hidden Markov models and tree-valued Markov processes.
We also discuss how this process and its extensions fit into the more general framework of statistical modeling of structure and dependence via combinatorial stochastic processes, e.g.\ random partitions, trees and networks, and the practical importance of infinite exchangeability in this context. |
| Back To Top |
| January 20, 2012 |
| Anindra Bhadra |
| Simulation-based maximum likelihood inference for partially observed Markov process models |
| January 20, 2012 10:00 am |
| OSB 110 |
|
| Estimation of static (or time constant) parameters in a general
class of nonlinear, non-Gaussian, partially observed Markov process models
is an active area of research. In recent years, simulation-based
techniques have made estimation and inference feasible for these models
and have offered great flexibility to the modeler. An advantageous feature
of many of these techniques is that there is no requirement to evaluate
the state transition density of the model, which is often high-dimensional
and unavailable in closed-form. Instead, inference can proceed as long as
one is able to simulate from the state transition density - often a much
simpler problem. In this talk, we introduce a simulation-based maximum
likelihood inference technique known as iterated filtering that uses an
underlying sequential Monte Carlo (SMC) filter. We discuss some key
theoretical properties of iterated filtering. In particular, we prove the
convergence of the method and establish connections between iterated
filtering and well-known stochastic approximation methods. We then use the
iterated filtering technique to estimate parameters in a nonlinear,
non-Gaussian mechanistic model of malaria transmission and answer
scientific questions regarding the effect of climate factors on malaria
epidemics in Northwest India. Motivated by the challenges encountered in
modeling the malaria data, we conclude by proposing an improvement
technique for SMC filters used in an off-line, iterative setting. |
| Back To Top |
| January 13, 2012 |
| Xinge Jessie Jeng |
| Optimal Sparse Signal Identification with Applications in Copy Number Variation Analysis |
| January 13, 2012 10:00 am |
| 110 OSB |
|
| DNA copy number variation (CNV) plays an important role in population diversity and complex diseases. Motivated by CNV analysis based on high-density single nucleotide polymorphism (SNP) data, we consider two problems arising from the need to identify sparse and short CNV segments in long sequences of genome-wide data. The first problem is to identify the CNVs utilizing a single sample. An efficient likelihood ratio selection (LRS) procedure is developed, and its asymptotic optimality is presented for identifying short and sparse CNVs. The second problem aims to identify recurrent CNVs based on a large number of samples from a population. We propose a proportion adaptive segment selection (PASS) procedure that automatically and optimally adjusts to the unknown proportions of CNV carriers.
In these problems, we introduce an innovative statistical framework for developing optimal procedures for CNV analysis. We study fundamental properties for signal identification by characterizing the detectable and the undetectable regions. Only in the detectable region, it is possible to consistently separate the CNV signals from noise. Such demarcations can provide deep insights towards methods development and serve as benchmarks for evaluating methods. We prove that the LRS and PASS are consistent in the interiors of each of their respective detectable regions, thus, implying asymptotic optimalities of the proposed methods.
The proposed methods are demonstrated with simulations and analysis of a family trio dataset and a Neuroblastoma dataset. The results show that the LRS procedure can yield greater gain in power for detecting short CNVs than some popular CNV identification procedures and PASS significantly improves the power for CNV detection by pooling information from multiple samples and efficiently identifying both rare and common CNVs carried by neuroblastoma patients. |
| Back To Top |
| January 10, 2012 |
| Ingram Olkin |
| INEQUALITIES: THEORY OF MAJORIZATION AND ITS APPLICATIONS |
| January 10, 2012 3:30 pm |
| 110 OSB |
|
| There are many theories of "equations": linear equations, differential
equations, functional equations, and more, However, there is no central theory of
"inequations" There are several general themes that lead to many inequalities. One such theme is convexity. Another theme is majorization, which is a particular partial order. What us important in this context is that the partial order have lots of examples, and that teh order-preserving functions be a rich class. In this case majorization arises in many fields: in mathematics:geometry, numerical analysis, graph theory; in other fields: physics, chemistry, political science, economics. In this talk we describe the origins of majorization and many examples of majorization and its consequences. |
| Back To Top |
|
|
|