Base-calling is a central part in any large-scale genomic
sequencing effort.
It takes the above vector time series (Here are
two segments),
and produces an estimate
of the underlying DNA sequence which gave rise to that signal.
My dissertation approaches
the problem of base-calling from a statistical perspective
hoping to make use of a statistical model to call bases, and also
to attach suitable measures of uncertainty to the bases we call.
The research is conducted under my advisor: Terry
Speed. Other peole in this group are Dave Nelson and Simon
Cawley.
We are really trying to mimic the DNA sequencing procedure by this
statistical model. Namely, given a DNA sequence, we can generate from
this model a virtual sequencing trace signal, which shows quite a lot
of similarity to the real data. Here is one
example.
More examples of simulated base-calling can be found in Simon's
homepage.
My current research on DNA base-calling mainly concerns three
topics: hidden Markov models, deconvolution, color
separation.
Five manuscripts have arisen from this research. My
dissertation will be a collection of these manuscripts. More details are
given by the following
titles and summaries.
In this paper, we describe a new base-calling algorithm for use with automatic fluorescence-based DNA sequencing instruments. Its main steps consist of color separation, deconvolution, and the use of a hidden Markov model (HMM) decoder. The core of the algorithm is the HMM, which has a finite number of hidden states and mixtures of Gaussians for outputs. This model is designed to take into account variations in the amplitude and spacing of the peaks in the sequencing trace, at the same time as permitting the use of efficient algorithms for training and the reconstruction of the hidden states. Since this reconstruction is probabilistic in nature, a useful by-product is the probabilistic assessment of alternative base calls.
You can find the
topology graph of our hidden Markov model here.
This paper describes a deconvolution method appropriate for a particular class of signals, namely what we term spike-deconvolution models. These models arise when a sparse spike train is convolved with a fixed point-spread function, and additive noise or measurement error is superimposed. In this context we view deconvolution as an estimation problem, regarding the positions and amplitudes of the underlying spikes, as well as a background and noise parameter as unknowns. Our estimation scheme starts with a set of trigonometric moment estimates of positions and amplitudes, typically more than will ultimately be recognized, and then obtains the final parameters using a combination of backwards deletion and the BIC model selection criterion. We also present results on the spectral structure of Toeplitz matrices which play a role in the estimation. Finally, we present simulations to show how our estimates perform.
In this paper we compare several deconvolution techniques, including those minimizing a convex functional subject to nonnegativity constraints on the unknown function, the maximum entropy deconvolution, the technique introduced by Jansson, and a parametric approach we have recently described. Our focus is on the ease with which an algorithm can be implemented automatically, and on aspects of the reconstructed function, such as having the correct number of peaks. We describe the methods, make some general comments about their implementation, and compare them on a segment from an electrophoretic trace obtained from a DNA sequencing instrument.
Here is some result.
Color separation is an essential step in the analysis of intensity data collected on automated DNA sequencing instruments using a four-dye strategy. We begin the paper by describing a graphical approach to displa ying the phenomenon of cross-talk. The display suggests a natural estimate of the the color separation matrix, and we present an iterative which produces such an estimate. Each loop of the algorithm has two steps, a first which samples typical points from the data, and a second, regression step which produce estimates of the parameters. The algorithm is illustrated on a sequencing trace obtained from a slab gel constructe d at the Human Genome Center at LBNL.
If you feel interested, look at this graphic display: data without color separated, and data being color separated.
In standard four-dye fluorescence-based DNA sequencing,
data on the intensity of light emitted by four laser-excited
fluorescent dyes are collected with a detection system at four
different wavelengths, to permit estimation of the concentration of
each dye in the detection region at the time of excitation. This paper
addresses the problem of estimating four dye concentrations from
intensities detected at only three wavelengths. Two methods are
proposed. The first attempts to solve the over-determined linear system
by elucidating the pattern of aliasing embodied in the cross-talk
matrix, while the second combines non-negative least squares with a
series of model selections. Both attempt to deal with statistical and
chemical features of the problem. The two methods are illustrated on
two sets of DNA sequencing data generated by capillary electrophoresis
in the lab of Professor R. Mathies, Department of chemistry, University
of California at Berkeley.