BlogVox: Separating Blog Wheat from Blog Chaff
dc.contributor.author | Java, Akshay | |
dc.contributor.author | Kolari, Pranam | |
dc.contributor.author | Finin, Tim | |
dc.contributor.author | Mayfield, James | |
dc.contributor.author | Joshi, Anupam | |
dc.contributor.author | Martineau, Justin | |
dc.date.accessioned | 2018-11-29T19:28:01Z | |
dc.date.available | 2018-11-29T19:28:01Z | |
dc.date.issued | 2007-01-07 | |
dc.description | Proceedings of the Workshop on Analytics for Noisy Unstructured Text Data, 20th International Joint Conference on Artificial Intelligence (IJCAI-2007) | en_US |
dc.description.abstract | Blog posts are often informally written, poorly structured, rife with spelling and grammatical errors, and feature non-traditional content. These characteristics make them difficult to process with standard language analysis tools. Performing linguistic analysis on blogs is plagued by two additional problems: (i) the presence of spam blogs and spam comments and (ii) extraneous non-content including blog-rolls, link-rolls, advertisements and sidebars. We describe techniques designed to eliminate noisy blog data developed as part of the BlogVox system - a blog analytics engine we developed for the 2006 TREC Blog Track. The findings in this paper underscore the importance of removing spurious content from blog collections. | en_US |
dc.description.sponsorship | Partial support was provided by an IBM Fellowship and by NSF awards ITR-IIS-0326460 and ITR-IDM-0219649. | en_US |
dc.description.uri | https://ebiquity.umbc.edu/paper/html/id/326/BlogVox-Separating-Blog-Wheat-from-Blog-Chaff | en_US |
dc.format.extent | 7 pages | en_US |
dc.genre | conference papers and proceedings preprints | en_US |
dc.identifier | doi:10.13016/M2Q23R42V | |
dc.identifier.uri | http://hdl.handle.net/11603/12135 | |
dc.language.iso | en_US | en_US |
dc.relation.isAvailableAt | The University of Maryland, Baltimore County (UMBC) | |
dc.relation.ispartof | UMBC Computer Science and Electrical Engineering Department Collection | |
dc.relation.ispartof | UMBC Faculty Collection | |
dc.relation.ispartof | UMBC Student Collection | |
dc.rights | This item is likely protected under Title 17 of the U.S. Copyright Law. Unless on a Creative Commons license, for uses protected by Copyright Law, contact the copyright holder or the author. | |
dc.subject | BlogVox | en_US |
dc.subject | Blog | en_US |
dc.subject | social media | en_US |
dc.subject | spam | en_US |
dc.subject | UMBC Ebiquity Research Group | en_US |
dc.title | BlogVox: Separating Blog Wheat from Blog Chaff | en_US |
dc.type | Text | en_US |
Files
License bundle
1 - 1 of 1
No Thumbnail Available
- Name:
- license.txt
- Size:
- 2.56 KB
- Format:
- Item-specific license agreed upon to submission
- Description: