DNA Base-Calling Project



The current Sanger's sequencing technique is a combination of enzymatic reactions, electrophoresis and fluorescence-based detection techniques. This diagram illustrates the basic idea of this scheme. This procedure produces a four-component vector time series: the fluorescence intensities.


Base-calling is a central part in any large-scale genomic sequencing effort. It takes the above vector time series (Here are two segments), and produces an estimate of the underlying DNA sequence which gave rise to that signal.
My dissertation approaches the problem of base-calling from a statistical perspective hoping to make use of a statistical model to call bases, and also to attach suitable measures of uncertainty to the bases we call. The research is conducted under my advisor: Terry Speed. Other peole in this group are Dave Nelson and Simon Cawley.


Big Picture


The following flow chart depicts the struture of our model and base-calling strategy.


We are really trying to mimic the DNA sequencing procedure by this statistical model. Namely, given a DNA sequence, we can generate from this model a virtual sequencing trace signal, which shows quite a lot of similarity to the real data. Here is one example. More examples of simulated base-calling can be found in Simon's homepage.


My current research on DNA base-calling mainly concerns three topics: hidden Markov models, deconvolution, color separation. Five manuscripts have arisen from this research. My dissertation will be a collection of these manuscripts. More details are given by the following titles and summaries.


Hidden Markov Models


A Hidden Markov Model of DNA sequencing.

In this paper, we describe a new base-calling algorithm for use with automatic fluorescence-based DNA sequencing instruments. Its main steps consist of color separation, deconvolution, and the use of a hidden Markov model (HMM) decoder. The core of the algorithm is the HMM, which has a finite number of hidden states and mixtures of Gaussians for outputs. This model is designed to take into account variations in the amplitude and spacing of the peaks in the sequencing trace, at the same time as permitting the use of efficient algorithms for training and the reconstruction of the hidden states. Since this reconstruction is probabilistic in nature, a useful by-product is the probabilistic assessment of alternative base calls.


You can find the topology graph of our hidden Markov model here.



Deconvolution

A parametric deconvolution method.

This paper describes a deconvolution method appropriate for a particular class of signals, namely what we term spike-deconvolution models. These models arise when a sparse spike train is convolved with a fixed point-spread function, and additive noise or measurement error is superimposed. In this context we view deconvolution as an estimation problem, regarding the positions and amplitudes of the underlying spikes, as well as a background and noise parameter as unknowns. Our estimation scheme starts with a set of trigonometric moment estimates of positions and amplitudes, typically more than will ultimately be recognized, and then obtains the final parameters using a combination of backwards deletion and the BIC model selection criterion. We also present results on the spectral structure of Toeplitz matrices which play a role in the estimation. Finally, we present simulations to show how our estimates perform.


Comparison of some deconvolution techniques.

In this paper we compare several deconvolution techniques, including those minimizing a convex functional subject to nonnegativity constraints on the unknown function, the maximum entropy deconvolution, the technique introduced by Jansson, and a parametric approach we have recently described. Our focus is on the ease with which an algorithm can be implemented automatically, and on aspects of the reconstructed function, such as having the correct number of peaks. We describe the methods, make some general comments about their implementation, and compare them on a segment from an electrophoretic trace obtained from a DNA sequencing instrument.

Here is some result.


Color Separation

Estimation of the color separation matrix in four-dye fluorescence-based DNA sequencing.

Color separation is an essential step in the analysis of intensity data collected on automated DNA sequencing instruments using a four-dye strategy. We begin the paper by describing a graphical approach to displa ying the phenomenon of cross-talk. The display suggests a natural estimate of the the color separation matrix, and we present an iterative which produces such an estimate. Each loop of the algorithm has two steps, a first which samples typical points from the data, and a second, regression step which produce estimates of the parameters. The algorithm is illustrated on a sequencing trace obtained from a slab gel constructe d at the Human Genome Center at LBNL.

If you feel interested, look at this graphic display: data without color separated, and data being color separated.


Recovering four dye concentrations from light intensities in three wavelength bands in DNA sequencing

In standard four-dye fluorescence-based DNA sequencing, data on the intensity of light emitted by four laser-excited fluorescent dyes are collected with a detection system at four different wavelengths, to permit estimation of the concentration of each dye in the detection region at the time of excitation. This paper addresses the problem of estimating four dye concentrations from intensities detected at only three wavelengths. Two methods are proposed. The first attempts to solve the over-determined linear system by elucidating the pattern of aliasing embodied in the cross-talk matrix, while the second combines non-negative least squares with a series of model selections. Both attempt to deal with statistical and chemical features of the problem. The two methods are illustrated on two sets of DNA sequencing data generated by capillary electrophoresis in the lab of Professor R. Mathies, Department of chemistry, University of California at Berkeley.