BlogVox: Separating Blog Wheat from Blog Chaff

Java, AkshayKolari, PranamFinin, TimMayfield, JamesJoshi, AnupamMartineau, JustinBlogVox: Separating Blog Wheat from Blog ChaffMy University2007BlogVoxBlogsocial mediaspamUMBC Ebiquity Research GroupMy UniversityMy University2018-11-292018-11-292007-01-07enTexthttp://hdl.handle.net/11603/121357 pagesThis item is likely protected under Title 17 of the U.S. Copyright Law. Unless on a Creative Commons license, for uses protected by Copyright Law, contact the copyright holder or the author.Proceedings of the Workshop on Analytics for Noisy Unstructured Text Data, 20th International Joint Conference on Artificial Intelligence (IJCAI-2007)Blog posts are often informally written, poorly structured, rife with spelling and grammatical errors, and feature non-traditional content. These characteristics make them difficult to process with standard language analysis tools. Performing linguistic analysis on blogs is plagued by two additional problems: (i) the presence of spam blogs and spam comments and (ii) extraneous non-content including blog-rolls, link-rolls, advertisements and sidebars. We describe techniques designed to eliminate noisy blog data developed as part of the BlogVox system - a blog analytics engine we developed for the 2006 TREC Blog Track. The findings in this paper underscore the importance of removing spurious content from blog collections.