Detection of near duplicate threads on online question & answer forums.


Author/Creator ORCID




Computer Science and Electrical Engineering


Computer Science

Citation of Original Publication


Distribution Rights granted to UMBC by the author.
Access limited to the UMBC community. Item may possibly be obtained via Interlibrary Loan thorugh a local library, pending author/copyright holder's permission.
This item is likely protected under Title 17 of the U.S. Copyright Law. Unless on a Creative Commons license, for uses protected by Copyright Law, contact the copyright holder or the author.



The number of questions asked on question and answer (Q&A) forums like Stack Overflow, Quora, and Twitter, is increasing rapidly. Millions of users visit these sites each month and post their questions. It is no surprise that many of these questions are duplicates. Users may have to wait for a long time to get answers to their questions even though related questions have already been answered. So, it is important to have an automatic way of identifying duplicate threads. On Stack Overflow, users with higher reputations mark questions as duplicate, which are then forwarded to moderators who decide if a question is a duplicate or not. Quora, on the other hand, uses a Random Forest model to identify duplicate questions. In this research, we have built a ML model using word2vec from Gensim, trained on Google's 3 million word news dataset; and Long Short-Term Memory networks (LSTMs), which is a deep learning technique. The trained model performs well, predicting duplicate threads with an accuracy of 84.15% in the experiments. The deep learning model outperforms the traditional machine learning models in terms of accuracy and speed. This model will make it easier to find high quality answers to questions, resulting in an improved experience for Q&A writers, seekers, and readers.