Statistical Techniques For Language Recognition: An Empirical Study Using Real And Simulated English

dc.contributor.authorGanesan, Ravi
dc.contributor.authorSherman, Alan T.
dc.date.accessioned2019-02-21T16:00:26Z
dc.date.available2019-02-21T16:00:26Z
dc.date.issued2010-06-04
dc.description.abstractComputer experiments compare the effectiveness of five test statistics at recognizing and distinguishing several types of real and simulated English strings. These experiments measure the statistical power and robustness of the test statistics X², ML, IND, S, and IC when applied to samples of everyday American English from the Brown Corpus and Wall Street Journal and to simulated English generated from lst-order Markov models based on these samples. An empirical approach is needed because the asymptotic theory of statistical inference on Markov chains does not apply to short strings drawn from natural language. Here, X² is the chi-squared test statistic; ML is a likelihood ratio test for recognizing a known language; IND is a likelihood ratio test for distinguishing unknown Oth-order noise from unknown lst-order language; S is a log-likelihood function that is a most-powerful test for distinguishing a known language from uniform noise; and IC is the index of coincidence. The test languages comprise four types of real English, two types of simulated lst-order English, and three types of noise. Two experiments characterize the distributions of these test statistics when applied to nine test languages, presented as strings of different lengths and contaminated with various amounts of noise. Experiment 1 varies the length of the string from 2 to 2¹⁷ characters. Experiment 2 adds uniform noise to samples of three fixed lengths (2⁴, 2⁷, 2¹⁰), with the amount of added noise ranging from 0% to 100%. These experiments assess the performance of the test statistics under realistic cryptographic constraints. Using graphs and tables of observed statistical power, we compare the effectiveness of the test statistics at distinguishing various pairs of languages at several critical levels. Although no statistic dominated all others for all critical levels and string lengths, each test performed well at its designated task. For distinguishing a known type of English from uniform noise at critical levels 0.1 through 0.0001, X 2 attained the highest power, with ML and S also performing well. For distinguishing uniform noise from a known type of English at the same critical levels, ML had the overall best performance, with IC, X², S, and IND also performing well. And for each of these tasks under noisy conditions, ML attained the highest power. In addition, through histograms we describe the actual distribution of each statistic on various language types. These detailed results, which show relationships between power, critical level, and string length, will help cryptanalysts and others apply statistical methods to practical language-recognition problems.en_US
dc.description.urihttps://www.tandfonline.com/doi/pdf/10.1080/0161-119491882919?needAccess=trueen_US
dc.format.extent21 pagesen_US
dc.genrejournal articles postprintsen_US
dc.identifierdoi:10.13016/m2y6pw-pf7r
dc.identifier.citationRavi Ganesan & Alan T. Sherman (1994) STATISTICAL TECHNIQUES FOR LANGUAGE RECOGNITION: AN EMPIRICAL STUDY USING REAL AND SIMULATED ENGLISH, CRYPTOLOGIA, 18:4, 289-331, DOI: 10.1080/0161-119491882919en_US
dc.identifier.urihttp://hdl.handle.net/11603/12834
dc.language.isoen_USen_US
dc.publisherTaylor & Francis
dc.relation.isAvailableAtThe University of Maryland, Baltimore County (UMBC)
dc.relation.ispartofUMBC Center for Research and Exploration in Space Sciences & Technology II (CRSST II)
dc.relation.ispartofUMBC Faculty Collection
dc.relation.ispartofUMBC Computer Science and Electrical Engineering Department
dc.rightsThis item is likely protected under Title 17 of the U.S. Copyright Law. Unless on a Creative Commons license, for uses protected by Copyright Law, contact the copyright holder or the author.
dc.rights“This is an Accepted Manuscript of an article published by Taylor & Francis in CRYPTOLOGIA, on 04 Jun 2010, available online: http://www.tandfonline.com/10.1080/0161-119491882919”
dc.subjectautomatic language recognitionen_US
dc.subjectstatistical approaches to language recognitionen_US
dc.subjectcryptographyen_US
dc.subjectcryptanalysisen_US
dc.titleStatistical Techniques For Language Recognition: An Empirical Study Using Real And Simulated Englishen_US
dc.typeTexten_US

Files

Original bundle
Now showing 1 - 1 of 1
Loading...
Thumbnail Image
Name:
ShermanCryptologia94.pdf
Size:
1012.45 KB
Format:
Adobe Portable Document Format
Description:
License bundle
Now showing 1 - 1 of 1
No Thumbnail Available
Name:
license.txt
Size:
2.56 KB
Format:
Item-specific license agreed upon to submission
Description: