Characteristics of HTML in the Deep Web

This item may be protected under Title 17 of the U.S. Copyright Law. It is made available by UMBC for non-commercial research and education. For permission to publish or reproduce, please see http://aok.lib.umbc.edu/specoll/repro.php or contact Special Collections at speccoll(at)umbc.edu
Distribution Rights granted to UMBC by the author.

Subjects

Deep Web
HTML
web crawling

Abstract

This paper explores the HTML characteristics of the deep web by gathering HTML tag frequencies on web pages using three different web crawling techniques. The first web crawling technique used the most popular websites listed by Alexa as the seed for the web crawler and randomly selected a sample of web pages to include in the statistics. The second web crawling technique consisted of web pages gathered from randomly generating shorten URLs and visiting pages that the shortened URLs redirected to. The third web crawling technique traversed the deep web going through .onion web sites and domains by randomly generating a IP. Statistics from these web crawling techniques are gathered and compared in this paper.

Characteristics of HTML in the Deep Web

Files

Links to Files

Permanent Link

Collections

Author/Creator

Author/Creator ORCID

Date

Type of Work

Department

Program

Citation of Original Publication

Rights

Subjects

Abstract