Optimization of Computations Used in Information Theory Applied to Base Pair Analysis

Coates, Andrew; Ilchenko, Alexey; O’Neill, Patrick; Erill, Ivan

Optimization of Computations Used in Information Theory Applied to Base Pair Analysis

dc.contributor.advisor	Gobbert, Matthias K.
dc.contributor.advisor	Neerchal, Nagaraj K.
dc.contributor.author	Coates, Andrew
dc.contributor.author	Ilchenko, Alexey
dc.contributor.author	O’Neill, Patrick
dc.contributor.author	Erill, Ivan
dc.date.accessioned	2018-10-24T19:16:48Z
dc.date.available	2018-10-24T19:16:48Z
dc.date.issued	2011
dc.description.abstract	Information theory is used in biology as an approach to analyzing and discovering enzyme binding sites in DNA. By using entropy, a measure of uncertainty, to analyze samples, information can be gained to make predictions more certain, or less uncertain. Information is gained by measuring how uncertain we expect to be at a site and deduct the uncertainty from observing the site. To get the desired accuracy needed for these predictions, one has to do costly computations. When there are hundreds of samples, values can be approximated sufficiently accurately. When there are very few samples, values can be calculated by exhaustive enumeration. In the intermediate range, however, it’s not clear that the existing approximations are optimal and the original method of exhaustive enumeration is very costly with regards to time and memory. The better these predictions are the better biologists can predict where enzymes will bind when performing expensive wet lab experiments. By predicting where the enzymes bind, they are predicting what function the protein will take. This work develops an algorithm for the intermediate range of sample sizes that is much faster and uses less memory in the process than the original method. The original naive computation is of exponential time complexity with respect to the number of samples. The new algorithm uses number theory to reduce the time complexity to cubic and further simplify the computation by using assumptions from biology. This is a substantial speedup in all ranges of sample sizes, born out by comparing Python, MATLAB, and serial C code. Computational experiments that parallelize the studies using the Intel Concurrent Collections (CnC) programming paradigm moreover show that tools from parallel computing can be useful to make larger parameter studies feasible in reasonable time.	en
dc.description.sponsorship	Interdisciplinary Program in High Performance Computing (www.umbc.edu/hpcreu) in the UMBC Department of Mathematics and Statistics, funded by the National Science Foundation (grant no. DMS–0851749). This program is also supported by UMBC, the Department of Mathematics and Statistics, the Center for Interdisciplinary Research and Consulting (CIRC), and the UMBC High Performance Computing Facility (HPCF). The computational hardware in HPCF (www.umbc.edu/hpcf) is partially funded by the National Science Foundation through the MRI program (grant no. CNS–0821258) and the SCREMS program (grant no. DMS–0821311), with additional substantial support from UMBC.	en
dc.description.uri	https://userpages.umbc.edu/~gobbert/papers/REU2011Team3.pdf	en
dc.format.extent	9 pages	en
dc.genre	technical report	en
dc.identifier	doi:10.13016/M2PZ51Q7M
dc.identifier.uri	http://hdl.handle.net/11603/11675
dc.language.iso	en	en
dc.relation.isAvailableAt	The University of Maryland, Baltimore County (UMBC)
dc.relation.ispartof	UMBC Mathematics Department Collection
dc.relation.ispartof	UMBC Faculty Collection
dc.relation.ispartof	UMBC Student Collection
dc.relation.ispartof	UMBC Biological Sciences Department
dc.relation.ispartofseries	HPCF Technical Report;HPCF–2011–13
dc.rights	This item is likely protected under Title 17 of the U.S. Copyright Law. Unless on a Creative Commons license, for uses protected by Copyright Law, contact the copyright holder or the author.
dc.subject	entropy
dc.subject	predicting where enzymes will bind
dc.subject	Optimization	en
dc.subject	Information Theory	en
dc.subject	Base Pair Analysis	en
dc.subject	Concurrent Collections (CnC)	en
dc.subject	UMBC High Performance Computing Facility (HPCF)	en
dc.title	Optimization of Computations Used in Information Theory Applied to Base Pair Analysis	en
dc.type	Text	en

Files

Original bundle

Now showing 1 - 2 of 2

Name:: REU2011Team3.pdf
Size:: 122.55 KB
Format:: Adobe Portable Document Format
Description:

Download

Name:: REU2011Team3.zip
Size:: 4.62 KB
Format:: Unknown data format
Description:

Download

License bundle

Now showing 1 - 1 of 1

Name:: license.txt
Size:: 1.68 KB
Format:: Item-specific license agreed upon to submission
Description:

Download

Collections

UMBC Mathematics and Statistics Department
UMBC Biological Sciences Department
UMBC Faculty Collection
UMBC Student Collection