Parametric bootstrapping for biological sequence motifs

dc.contributor.authorO’Neill, Patrick K.
dc.contributor.authorErill, Ivan
dc.date.accessioned2021-03-05T19:53:47Z
dc.date.available2021-03-05T19:53:47Z
dc.date.issued2016-10-06
dc.description.abstractBackground Biological sequence motifs drive the specific interactions of proteins and nucleic acids. Accordingly, the effective computational discovery and analysis of such motifs is a central theme in bioinformatics. Many practical questions about the properties of motifs can be recast as random sampling problems. In this light, the task is to determine for a given motif whether a certain feature of interest is statistically unusual among relevantly similar alternatives. Despite the generality of this framework, its use has been frustrated by the difficulties of defining an appropriate reference class of motifs for comparison and of sampling from it effectively. Results We define two distributions over the space of all motifs of given dimension. The first is the maximum entropy distribution subject to mean information content, and the second is the truncated uniform distribution over all motifs having information content within a given interval. We derive exact sampling algorithms for each. As a proof of concept, we employ these sampling methods to analyze a broad collection of prokaryotic and eukaryotic transcription factor binding site motifs. In addition to positional information content, we consider the informational Gini coefficient of the motif, a measure of the degree to which information is evenly distributed throughout a motif’s positions. We find that both prokaryotic and eukaryotic motifs tend to exhibit higher informational Gini coefficients (IGC) than would be expected by chance under either reference distribution. As a second application, we apply maximum entropy sampling to the motif p-value problem and use it to give elementary derivations of two new estimators. Conclusions Despite the historical centrality of biological sequence motif analysis, this study constitutes to our knowledge the first use of principled null hypotheses for sequence motifs given information content. Through their use, we are able to characterize for the first time differerences in global motif statistics between biological motifs and their null distributions. In particular, we observe that biological sequence motifs show an unusual distribution of IGC, presumably due to biochemical constraints on the mechanisms of direct read-out.en_US
dc.description.sponsorshipThe authors wish to thank Sefa Kılıç for assistance with data collection and motif structure detection, Rory Donovan for advice on a preliminary study that led to the present work, and Lies Boelen for many helpful discussions. The authors were supported by a grant from the US National Science Foundation (MCB-1158056). Publication costs were also defrayed by the US National Science Foundation (MCB-1158056). The US National Science Foundation had no role in the design of the study; collection, analysis, and interpretation of data; nor in writing the manuscript.en_US
dc.description.urihttps://bmcbioinformatics.biomedcentral.com/articles/10.1186/s12859-016-1246-8en_US
dc.format.extent18 pagesen_US
dc.genrejournal articlesen_US
dc.identifierdoi:10.13016/m2rh6o-d19x
dc.identifier.citationO’Neill, P.K., Erill, I. Parametric bootstrapping for biological sequence motifs. BMC Bioinformatics 17, 406 (2016). https://doi.org/10.1186/s12859-016-1246-8en_US
dc.identifier.urihttps://doi.org/10.1186/s12859-016-1246-8
dc.identifier.urihttp://hdl.handle.net/11603/21085
dc.language.isoen_USen_US
dc.publisherBMCen_US
dc.relation.isAvailableAtThe University of Maryland, Baltimore County (UMBC)
dc.relation.ispartofUMBC Biological Sciences Department Collection
dc.relation.ispartofUMBC Faculty Collection
dc.rightsThis item is likely protected under Title 17 of the U.S. Copyright Law. Unless on a Creative Commons license, for uses protected by Copyright Law, contact the copyright holder or the author.
dc.rightsAttribution 4.0 International*
dc.rights.urihttps://creativecommons.org/licenses/by/4.0/*
dc.titleParametric bootstrapping for biological sequence motifsen_US
dc.typeTexten_US

Files

Original bundle
Now showing 1 - 1 of 1
Loading...
Thumbnail Image
Name:
s12859-016-1246-8.pdf
Size:
1.19 MB
Format:
Adobe Portable Document Format
Description:
License bundle
Now showing 1 - 1 of 1
No Thumbnail Available
Name:
license.txt
Size:
2.56 KB
Format:
Item-specific license agreed upon to submission
Description: