Characteristics of HTML in the Deep Web

Author/Creator

Author/Creator ORCID

Date

2015-01-01

Type of Work

Department

Computer Science and Electrical Engineering

Program

Computer Science

Citation of Original Publication

Rights

This item may be protected under Title 17 of the U.S. Copyright Law. It is made available by UMBC for non-commercial research and education. For permission to publish or reproduce, please see http://aok.lib.umbc.edu/specoll/repro.php or contact Special Collections at speccoll(at)umbc.edu
Distribution Rights granted to UMBC by the author.

Abstract

This paper explores the HTML characteristics of the deep web by gathering HTML tag frequencies on web pages using three different web crawling techniques. The first web crawling technique used the most popular websites listed by Alexa as the seed for the web crawler and randomly selected a sample of web pages to include in the statistics. The second web crawling technique consisted of web pages gathered from randomly generating shorten URLs and visiting pages that the shortened URLs redirected to. The third web crawling technique traversed the deep web going through .onion web sites and domains by randomly generating a IP. Statistics from these web crawling techniques are gathered and compared in this paper.