Printable PDF
Department of Mathematics,
University of California San Diego

****************************

Bioinformatics Colloquium

Chiara Sabatti

University of California, Los Angeles

Genomewide motif recognition with a dictionary model

Abstract:

Bussemaker et al. (2000, PNAS) proposed the simple idea ofmodeling DNA non coding sequence as a concatenation of words and gavean algorithm to reconstruct deterministic words from an observedsequence. Moving from the same premises, we consider words that canbe spelled in a variety of forms (hence accounting for varying degreesof conservation of the same motif across genome locations).These ``words'' correspond to binding sites of regualtory proteins. Theoverall frequency of occurrence of each word in the sequence and theparameters describing the random spelling of words are estimated in amaximum-likelihood framework using an E-M gradient algorithm. Once these parameters are estimated, it is possible toevaluate the probability with which each motif occurs at a givenlocation in the sequence. These conditional probabilities can be used to predict whichgenes experience similar transcription regulations. Gene expression data can be used tovalidate/refine such predictions.

Host: Ian Abramson

November 14, 2002

3:00 PM

AP&M 6438

****************************