Optimization of Computations Used in Information Theory Applied to Base Pair Analysis

dc.contributor.advisorGobbert, Matthias K.
dc.contributor.advisorNeerchal, Nagaraj K.
dc.contributor.authorCoates, Andrew
dc.contributor.authorIlchenko, Alexey
dc.contributor.authorO’Neill, Patrick
dc.contributor.authorErill, Ivan
dc.date.accessioned2018-10-24T19:16:48Z
dc.date.available2018-10-24T19:16:48Z
dc.date.issued2011
dc.description.abstractInformation theory is used in biology as an approach to analyzing and discovering enzyme binding sites in DNA. By using entropy, a measure of uncertainty, to analyze samples, information can be gained to make predictions more certain, or less uncertain. Information is gained by measuring how uncertain we expect to be at a site and deduct the uncertainty from observing the site. To get the desired accuracy needed for these predictions, one has to do costly computations. When there are hundreds of samples, values can be approximated sufficiently accurately. When there are very few samples, values can be calculated by exhaustive enumeration. In the intermediate range, however, it’s not clear that the existing approximations are optimal and the original method of exhaustive enumeration is very costly with regards to time and memory. The better these predictions are the better biologists can predict where enzymes will bind when performing expensive wet lab experiments. By predicting where the enzymes bind, they are predicting what function the protein will take. This work develops an algorithm for the intermediate range of sample sizes that is much faster and uses less memory in the process than the original method. The original naive computation is of exponential time complexity with respect to the number of samples. The new algorithm uses number theory to reduce the time complexity to cubic and further simplify the computation by using assumptions from biology. This is a substantial speedup in all ranges of sample sizes, born out by comparing Python, MATLAB, and serial C code. Computational experiments that parallelize the studies using the Intel Concurrent Collections (CnC) programming paradigm moreover show that tools from parallel computing can be useful to make larger parameter studies feasible in reasonable time.en_US
dc.description.sponsorshipInterdisciplinary Program in High Performance Computing (www.umbc.edu/hpcreu) in the UMBC Department of Mathematics and Statistics, funded by the National Science Foundation (grant no. DMS–0851749). This program is also supported by UMBC, the Department of Mathematics and Statistics, the Center for Interdisciplinary Research and Consulting (CIRC), and the UMBC High Performance Computing Facility (HPCF). The computational hardware in HPCF (www.umbc.edu/hpcf) is partially funded by the National Science Foundation through the MRI program (grant no. CNS–0821258) and the SCREMS program (grant no. DMS–0821311), with additional substantial support from UMBC.en_US
dc.description.urihttps://userpages.umbc.edu/~gobbert/papers/REU2011Team3.pdfen_US
dc.format.extent9 pagesen_US
dc.genretechnical reporten_US
dc.identifierdoi:10.13016/M2PZ51Q7M
dc.identifier.urihttp://hdl.handle.net/11603/11675
dc.language.isoen_USen_US
dc.relation.isAvailableAtThe University of Maryland, Baltimore County (UMBC)
dc.relation.ispartofUMBC Mathematics Department Collection
dc.relation.ispartofUMBC Faculty Collection
dc.relation.ispartofUMBC Student Collection
dc.relation.ispartofUMBC Biological Sciences Department
dc.relation.ispartofseriesHPCF Technical Report;HPCF–2011–13
dc.rightsThis item is likely protected under Title 17 of the U.S. Copyright Law. Unless on a Creative Commons license, for uses protected by Copyright Law, contact the copyright holder or the author.
dc.subjectOptimizationen_US
dc.subjectInformation Theoryen_US
dc.subjectBase Pair Analysisen_US
dc.subjectConcurrent Collections (CnC)en_US
dc.subjectUMBC High Performance Computing Facility (HPCF)en_US
dc.subjectentropy
dc.subjectpredicting where enzymes will bind
dc.titleOptimization of Computations Used in Information Theory Applied to Base Pair Analysisen_US
dc.typeTexten_US

Files

Original bundle

Now showing 1 - 2 of 2
Loading...
Thumbnail Image
Name:
REU2011Team3.pdf
Size:
122.55 KB
Format:
Adobe Portable Document Format
Description:
No Thumbnail Available
Name:
REU2011Team3.zip
Size:
4.62 KB
Format:
Unknown data format
Description:

License bundle

Now showing 1 - 1 of 1
No Thumbnail Available
Name:
license.txt
Size:
1.68 KB
Format:
Item-specific license agreed upon to submission
Description: