A reexamination of information theory-based methods for DNA-binding site identification

dc.contributor.authorErill, Ivan
dc.contributor.authorO'Neill, Michael C
dc.date.accessioned2021-03-09T16:48:51Z
dc.date.available2021-03-09T16:48:51Z
dc.date.issued2009-02-11
dc.description.abstractBackground Searching for transcription factor binding sites in genome sequences is still an open problem in bioinformatics. Despite substantial progress, search methods based on information theory remain a standard in the field, even though the full validity of their underlying assumptions has only been tested in artificial settings. Here we use newly available data on transcription factors from different bacterial genomes to make a more thorough assessment of information theory-based search methods. Results Our results reveal that conventional benchmarking against artificial sequence data leads frequently to overestimation of search efficiency. In addition, we find that sequence information by itself is often inadequate and therefore must be complemented by other cues, such as curvature, in real genomes. Furthermore, results on skewed genomes show that methods integrating skew information, such as Relative Entropy, are not effective because their assumptions may not hold in real genomes. The evidence suggests that binding sites tend to evolve towards genomic skew, rather than against it, and to maintain their information content through increased conservation. Based on these results, we identify several misconceptions on information theory as applied to binding sites, such as negative entropy, and we propose a revised paradigm to explain the observed results. Conclusion We conclude that, among information theory-based methods, the most unassuming search methods perform, on average, better than any other alternatives, since heuristic corrections to these methods are prone to fail when working on real data. A reexamination of information content in binding sites reveals that information content is a compound measure of search and binding affinity requirements, a fact that has important repercussions for our understanding of binding site evolution.en_US
dc.description.sponsorshipThe authors wish to thank Andrew Cameron and Rosie Redfield for kindly providing the sequences of CRP sites of H. influenzae. This work was supported partly by UMBC Special Research Assistantship/Initiative Support program. The authors would like to thank the reviewers of this manuscript for their insightful comments and suggestions.en_US
dc.description.urihttps://bmcbioinformatics.biomedcentral.com/articles/10.1186/1471-2105-10-57en_US
dc.format.extent22 pagesen_US
dc.genrejournal articlesen_US
dc.identifierdoi:10.13016/m2ji8n-ipk1
dc.identifier.citationErill, I., O'Neill, M.C. A reexamination of information theory-based methods for DNA-binding site identification. BMC Bioinformatics 10, 57 (2009). https://doi.org/10.1186/1471-2105-10-57en_US
dc.identifier.urihttps://doi.org/10.1186/1471-2105-10-57
dc.identifier.urihttp://hdl.handle.net/11603/21110
dc.language.isoen_USen_US
dc.publisherBMCen_US
dc.relation.isAvailableAtThe University of Maryland, Baltimore County (UMBC)
dc.relation.ispartofUMBC Biological Sciences Department Collection
dc.relation.ispartofUMBC Faculty Collection
dc.rightsThis item is likely protected under Title 17 of the U.S. Copyright Law. Unless on a Creative Commons license, for uses protected by Copyright Law, contact the copyright holder or the author.
dc.rightsAttribution 2.0 Generic*
dc.rights.urihttps://creativecommons.org/licenses/by/2.0/*
dc.titleA reexamination of information theory-based methods for DNA-binding site identificationen_US
dc.typeTexten_US

Files

Original bundle

Now showing 1 - 2 of 2
Loading...
Thumbnail Image
Name:
1471-2105-10-57.pdf
Size:
674.08 KB
Format:
Adobe Portable Document Format
Description:
Loading...
Thumbnail Image
Name:
12859_2008_2787_MOESM1_ESM.pdf
Size:
143.03 KB
Format:
Adobe Portable Document Format
Description:

License bundle

Now showing 1 - 1 of 1
No Thumbnail Available
Name:
license.txt
Size:
2.56 KB
Format:
Item-specific license agreed upon to submission
Description: