Statistical Techniques For Language Recognition: An Empirical Study Using Real And Simulated English

Author/Creator ORCID

Date

2010-06-04

Department

Program

Citation of Original Publication

Ravi Ganesan & Alan T. Sherman (1994) STATISTICAL TECHNIQUES FOR LANGUAGE RECOGNITION: AN EMPIRICAL STUDY USING REAL AND SIMULATED ENGLISH, CRYPTOLOGIA, 18:4, 289-331, DOI: 10.1080/0161-119491882919

Rights

This item is likely protected under Title 17 of the U.S. Copyright Law. Unless on a Creative Commons license, for uses protected by Copyright Law, contact the copyright holder or the author.
“This is an Accepted Manuscript of an article published by Taylor & Francis in CRYPTOLOGIA, on 04 Jun 2010, available online: http://www.tandfonline.com/10.1080/0161-119491882919”

Abstract

Computer experiments compare the effectiveness of five test statistics at recognizing and distinguishing several types of real and simulated English strings. These experiments measure the statistical power and robustness of the test statistics X², ML, IND, S, and IC when applied to samples of everyday American English from the Brown Corpus and Wall Street Journal and to simulated English generated from lst-order Markov models based on these samples. An empirical approach is needed because the asymptotic theory of statistical inference on Markov chains does not apply to short strings drawn from natural language. Here, X² is the chi-squared test statistic; ML is a likelihood ratio test for recognizing a known language; IND is a likelihood ratio test for distinguishing unknown Oth-order noise from unknown lst-order language; S is a log-likelihood function that is a most-powerful test for distinguishing a known language from uniform noise; and IC is the index of coincidence. The test languages comprise four types of real English, two types of simulated lst-order English, and three types of noise. Two experiments characterize the distributions of these test statistics when applied to nine test languages, presented as strings of different lengths and contaminated with various amounts of noise. Experiment 1 varies the length of the string from 2 to 2¹⁷ characters. Experiment 2 adds uniform noise to samples of three fixed lengths (2⁴, 2⁷, 2¹⁰), with the amount of added noise ranging from 0% to 100%. These experiments assess the performance of the test statistics under realistic cryptographic constraints. Using graphs and tables of observed statistical power, we compare the effectiveness of the test statistics at distinguishing various pairs of languages at several critical levels. Although no statistic dominated all others for all critical levels and string lengths, each test performed well at its designated task. For distinguishing a known type of English from uniform noise at critical levels 0.1 through 0.0001, X 2 attained the highest power, with ML and S also performing well. For distinguishing uniform noise from a known type of English at the same critical levels, ML had the overall best performance, with IC, X², S, and IND also performing well. And for each of these tasks under noisy conditions, ML attained the highest power. In addition, through histograms we describe the actual distribution of each statistic on various language types. These detailed results, which show relationships between power, critical level, and string length, will help cryptanalysts and others apply statistical methods to practical language-recognition problems.