Browsing by Subject "Blog"
Now showing 1 - 2 of 2
Results Per Page
Sort Options
Item Blog Track Open Task: Spam Blog Classification(2006-11-14) Kolari, Pranam; Java, Akshay; Finin, Tim; Mayfield, James; Joshi, Anupam; Martineau, JustinSpam blogs or Splogs are blogs created for the sole purpose of hosting ads, promoting affiliate sites and getting new content indexed, with auto-generated or plagiarized content from other sources. Spammers equipped with readily available splog creation software inundate the blogosphere both at ping servers, and at systems that index and analyze blogs. Our own studies estimate these numbers to be around 75% at ping servers and 20% at popular blog search engines. In this open submission we hence propose Spam Blog Classification as a new task in the Blog Track. Splogs are a specific instance of the more general spam web-pages. While offline graph based mechanisms like TrustRank are quite effective and sufficient for the Web, the blogosphere demands new techniques. The quality of blog search engines is judged not just by their reach, but also by their ability to index recent (non-spam) posts. This requires that fast online splog detection/filtering be used prior to indexing new content, followed by offline techniques that exploit link graph anomalies. The nature of this problem makes splog detection challenging. This open task submission underscores the seriousness of the splog problem in the TREC 2006 collection, details how it impacts the primary task of Opinion Identification, and proposes multiple assessment and evaluation approaches for a Spam Blog Classification task in Blog Track 2007.Item BlogVox: Separating Blog Wheat from Blog Chaff(2007-01-07) Java, Akshay; Kolari, Pranam; Finin, Tim; Mayfield, James; Joshi, Anupam; Martineau, JustinBlog posts are often informally written, poorly structured, rife with spelling and grammatical errors, and feature non-traditional content. These characteristics make them difficult to process with standard language analysis tools. Performing linguistic analysis on blogs is plagued by two additional problems: (i) the presence of spam blogs and spam comments and (ii) extraneous non-content including blog-rolls, link-rolls, advertisements and sidebars. We describe techniques designed to eliminate noisy blog data developed as part of the BlogVox system - a blog analytics engine we developed for the 2006 TREC Blog Track. The findings in this paper underscore the importance of removing spurious content from blog collections.