Detection of near duplicate threads on online question & answer forums.

Distribution Rights granted to UMBC by the author.
Access limited to the UMBC community. Item may possibly be obtained via Interlibrary Loan thorugh a local library, pending author/copyright holder's permission.
This item is likely protected under Title 17 of the U.S. Copyright Law. Unless on a Creative Commons license, for uses protected by Copyright Law, contact the copyright holder or the author.

Abstract

The number of questions asked on question and answer (Q&A) forums like Stack Overflow, Quora, and Twitter, is increasing rapidly. Millions of users visit these sites each month and post their questions. It is no surprise that many of these questions are duplicates. Users may have to wait for a long time to get answers to their questions even though related questions have already been answered. So, it is important to have an automatic way of identifying duplicate threads. On Stack Overflow, users with higher reputations mark questions as duplicate, which are then forwarded to moderators who decide if a question is a duplicate or not. Quora, on the other hand, uses a Random Forest model to identify duplicate questions. In this research, we have built a ML model using word2vec from Gensim, trained on Google's 3 million word news dataset; and Long Short-Term Memory networks (LSTMs), which is a deep learning technique. The trained model performs well, predicting duplicate threads with an accuracy of 84.15% in the experiments. The deep learning model outperforms the traditional machine learning models in terms of accuracy and speed. This model will make it easier to find high quality answers to questions, resulting in an improved experience for Q&A writers, seekers, and readers.

Detection of near duplicate threads on online question & answer forums.

Files

Links to Files

Permanent Link

Collections

Author/Creator

Author/Creator ORCID

Date

Type of Work

Department

Program

Citation of Original Publication

Rights

Subjects

Abstract