Biological Sequence Analysis Using Hadoop/Mapreduce As A Distributed Computing Model

No Thumbnail Available

Links to Files

Author/Creator

Author/Creator ORCID

Date

2012

Type of Work

Department

Computer Science and Bioinformatics Program

Program

Master of Science

Citation of Original Publication

Rights

This item is made available by Morgan State University for personal, educational, and research purposes in accordance with Title 17 of the U.S. Copyright Law. Other uses may require permission from the copyright owner.

Abstract

Most Biological (DNA, RNA or Protein) sequence analyzing algorithms are complex and require extensive execution time and memory. Serial Biological Sequence Processing Algorithms do not use the computing power of present computers very efficiently. Today, researchers and scientists have developed and tested many programming models for parallelizing and optimizing algorithms to decrease execution time and memory used. MapReduce is a programming model based on functional programming, where users implement interface of two functions - map and reduce. In general, map is a kind of application of functions and reduce is he aggregations of the results of those applications. MapReduce Programming Model is patented by Google. In this research, Hadoop implementation of MapReduce was used. Hadoop and Hadoop Distributed File System are open source models of MapReduce and Google File System. Hadoop framework automatically transforms map and reduce applications into map and reduce tasks. All known biological sequences and their functional annotations are stored in biological databases. A newly determined biological sequence should be compared with each and every known corresponding biological sequence to detect potential structural or evolutionary relationships. From a computational point of view, a major challenge is to align the query biological sequence to a very large collection of biological sequences and sort them according to the score of their alignment with the input biological sequence. The solution has to be fast and scalable. The main goals of this thesis research are: * To build a fully-distributed Ubuntu Hadoop cluster of four nodes. * To configure and test Hadoop cluster in the LittleFe cluster computer. * To seek, determine and measure the efficiency of program in terms of used time and memory. The main achievements/results of this thesis research are: * Transformation of the LittleFe BCCD operating system cluster computer into the Ubuntu operating system cluster computer. * Two Hadoop examples- the RandomTextWriter.java and SecondarySort.java were modified into the Hadoop MRGenerateDNA.java program to generate big file of random DNA sequences and the Hadoop MRSortDNA.java program to sort DNA sequences in an order respectively. * Proved that Hadoop is an efficient programming model to develop new parallel algorithms for biological sequence processing based on Map Reduce Programming model.