Beyond Social Structure: Predicting Influence in Online Review Communities

Author/Creator

Author/Creator ORCID

Date

2019-10-30

Department

Business and Management

Program

Doctor of Philosophy

Citation of Original Publication

Rights

Abstract

The intelligence derived from this dissertation provides businesses and online review communities (ORCs) with ways to exploit opportunities to predict influence for these communities. To my knowledge, there is little research in the prediction of influence. Moreover, few studies of influence use ORCs as a basis of investigation. This dissertation seeks to fill the gap by investigating the prediction of influence in an ORC and generating predictive models for this type of online social network (OSN), specifically Yelp. Emphasizing the differences between an ORC and a traditional OSN, past research and studies, the susceptible infective removed model, threshold model, cascade model, expectation’s theory, social influence and status characteristics theory, self-categorization theory and concepts of homophily and reciprocity, are used to ground a discussion of modifications to traditional measurements of influence for ORCs. Then a comparison is conducted into different measurements. The results are used to create predictive models using supervised machine learning algorithms. Two separate investigations are conducted to study the prediction of influence. First, an investigation is conducted into determinants of predicting influence, based on the differences associated with ORCs. Influence for the Yelp dataset is operationalized as a change in friends and a change in votes from one time period to another. Second, an investigation is conducted to generate models that predict influence. In comparing the determinants, three different levels are examined: the member, the member’s network, and the member along with the member’s network. Based on the results from the first investigation, the member level as well as the member along with the member’s network level are examined in the second investigation. In both investigations, members of the Yelp community are separated into four groups, based on their time since joining the site, to see if there are differences between the groups. The receiver operating characteristic (ROC) graph with its accompanying area under the curve (AUC) data is chosen for analysis in both investigations. This metric is used to assess the performance of classifiers. In investigating the best determinant of future influence, I compare the AUC values between the proposed determinants (friends count, votes count, and review count) and the dependent variable future influence. The results indicate that the review count is the best predictor of influence, whether using a change in friends or a change in votes as the dependent variable. In investigating the levels of the determinants, I compare the AUC data between the levels of the determinants and the dependent variable future influence. For a change in friends, the member along with the member’s network is the best level of measurement for all groups, no matter the predictor: friends count, votes count, and review count. In the case of change in votes, the member along with the member’s network is the best level of measurement (for the predictors votes count and review count) for members that have been a part of Yelp’s community for one month. The member is the best level of measurement (for the predictors votes count and review count) for groups in which the member has been a part of Yelp’s community for more than one month. In the case of friends count, the predictor does not have a member level. Thus, the member’s network is the same as the member along with the member’s network, so the two levels have the same results. Logistic regression, naive Bayes, neural networks (NN), and support vector machine (SVM) algorithms, are used in the second investigation. Logistic regression is introduced to generate models. Eight models are created and through statistical tests of the individual predictors (Wald chi-square statistic), goodness-of-fit statistics (Cox and Snell R2 and Nagalkerke R2), validation of predicted probabilities (classification table with its derived sensitivity, specificity, false positive, false negative, and overall percentage rates), a comparison of the AUC for the models probability and the AUC best predictors, NN algorithms, naive Bayes algorithms, and SVM algorithms, there is support for the robustness of these models.