BlogVox: Separating Blog Wheat from Blog Chaff

dc.contributor.authorJava, Akshay
dc.contributor.authorKolari, Pranam
dc.contributor.authorFinin, Tim
dc.contributor.authorMayfield, James
dc.contributor.authorJoshi, Anupam
dc.contributor.authorMartineau, Justin
dc.date.accessioned2018-11-29T19:28:01Z
dc.date.available2018-11-29T19:28:01Z
dc.date.issued2007-01-07
dc.descriptionProceedings of the Workshop on Analytics for Noisy Unstructured Text Data, 20th International Joint Conference on Artificial Intelligence (IJCAI-2007)en_US
dc.description.abstractBlog posts are often informally written, poorly structured, rife with spelling and grammatical errors, and feature non-traditional content. These characteristics make them difficult to process with standard language analysis tools. Performing linguistic analysis on blogs is plagued by two additional problems: (i) the presence of spam blogs and spam comments and (ii) extraneous non-content including blog-rolls, link-rolls, advertisements and sidebars. We describe techniques designed to eliminate noisy blog data developed as part of the BlogVox system - a blog analytics engine we developed for the 2006 TREC Blog Track. The findings in this paper underscore the importance of removing spurious content from blog collections.en_US
dc.description.sponsorshipPartial support was provided by an IBM Fellowship and by NSF awards ITR-IIS-0326460 and ITR-IDM-0219649.en_US
dc.description.urihttps://ebiquity.umbc.edu/paper/html/id/326/BlogVox-Separating-Blog-Wheat-from-Blog-Chaffen_US
dc.format.extent7 pagesen_US
dc.genreconference papers and proceedings preprintsen_US
dc.identifierdoi:10.13016/M2Q23R42V
dc.identifier.urihttp://hdl.handle.net/11603/12135
dc.language.isoen_USen_US
dc.relation.isAvailableAtThe University of Maryland, Baltimore County (UMBC)
dc.relation.ispartofUMBC Computer Science and Electrical Engineering Department Collection
dc.relation.ispartofUMBC Faculty Collection
dc.relation.ispartofUMBC Student Collection
dc.rightsThis item is likely protected under Title 17 of the U.S. Copyright Law. Unless on a Creative Commons license, for uses protected by Copyright Law, contact the copyright holder or the author.
dc.subjectBlogVoxen_US
dc.subjectBlogen_US
dc.subjectsocial mediaen_US
dc.subjectspamen_US
dc.subjectUMBC Ebiquity Research Groupen_US
dc.titleBlogVox: Separating Blog Wheat from Blog Chaffen_US
dc.typeTexten_US

Files

License bundle
Now showing 1 - 1 of 1
No Thumbnail Available
Name:
license.txt
Size:
2.56 KB
Format:
Item-specific license agreed upon to submission
Description: